Disaster in the Datacentre
It’s happened to everyone in Ops or DevOps at some time; after repeated warnings about the inability of whatever current hardware you have to handle its projected load, someone higher up releases the cash for an upgrade. Then, just as you’re provisioning the replacement, disaster strikes and down goes the existing production system.
It happened to me while I was working on upgrading the RADIUS infrastructure for a major public-access WiFi network. Millions of live hotspots with thousands of dollars in revenue and tens of thousands of RADIUS transactions every minute, handled by a redundant pair of RADIUS servers for years without major outage…and then bang! Down it goes.
Not a total disaster, as they were both still serving requests, but the replication between them had died and they were intermittently failing to record some of the transactions they served. It didn’t go on for long, perhaps an hour, but by that time we had inconsistent and duplicate accounting records for that period which represented potentially tens of thousands of dollars in revenue…a complete mess of millions of transaction records that we were utterly unable to bill for.
Tcl to the Rescue!
By luck I’d only gotten as far as installing the OS on the new hardware, a nice fat multi-CPU system destined to become a hypervisor for a multitude of virtual hosts. 24 virtual CPUs, 48Gb of RAM and a couple of terabytes of fast storage; all I had to do was figure out a way to put it all to work. Tcl to the rescue!
Lots of small parts of the overall system had Tcl in them somewhere, it was embedded in the hardware and some of the software we were using so I was pretty familiar with it and had built a few tools with it for analysing some of our data. I had some routines that could de-duplicate data records, but I needed to analyse hundreds of logs in different formats from different places around the network to produce a coherent view of what had happened during the outage.
All the data I needed was there, I just had to stitch it together somehow. I could have run multiple passes over the data, building up a reliable view pass by pass of the accounting records we needed, but that could have taken weeks. What I really needed to do was analyse everything at once.
The Simplicity of Tcl
Tcl is known for being wonderfully simple, simplistic even to some. It was this simplicity as applied to its threading model which was appealing for this particular task. One interpreter per thread, jobs sent from one thread to another as scripts, reporting back asynchronously and Tcl’s event loop handling those asynchronous responses. I built a script which spooled up a thread pool of 22 worker threads and had a master thread to collate the results and direct the workers. Workers were despatched to analyse a log file and report back on sessions or session fragments they’d found, and then the master thread would store this in memory (remember I had plenty!) and direct the next available threads to go looking for more evidence of those sessions in other logs. This process would continue until all the sessions that could be reliably accounted for had been found, then the master thread would simply dump all these sessions as a unified log suitable for input to the billing system.
Frankly, I still thought this process was likely to take hours but it completed in a matter of minutes. It turned out that this approach not only made best use of the hardware, maxing out the CPUs and using a good proportion of the memory, but because it was maintaining a complete view of the sessions involved throughout the process it could make more informed decisions about which logs were actually relevant and ignore those that weren’t. It managed to recover over 95% of the sessions we would have expected to see in that period and only needed to scan around 20% of the logs. In a couple of days’ worth of hacking, admittedly with the help of some fairly big hardware, I’d managed to save potentially weeks of error-prone processing. All this was completed in time for the normal billing cycle thanks to Tcl’s ability to quickly scale out to deal with big problems.
Tcl’s innate asynchronicity combined with the simplicity of its threading model makes it ideal for processing data from multiple sources concurrently. With the trend for adding more CPUs per die, now the race for ever-higher clock speeds is pretty much over, Tcl offers a simple path to taking advantage of these concurrent resources beyond simply partitioning them up for a multitude of virtual machines. Tcl is simple enough that it’s used by a raft of folk who would never refer to themselves as programmers, and the threading model is simple enough that it can open up the power of modern hardware to a very wide audience.
Title image courtesy of Steven Depolo on Flickr under Creative Commons License.