If software eats the world, you better use version control!

Find my posts on IT strategy, enterprise architecture, and digital transformation at ArchitectElevator.com.

As I recently observed, Corporate IT tends to be afraid of code: code is where all those pesky bugs come from that have to be fixed by quirky and unreliable software developers or , expensive external consultants. This attitude plays to the favor of vendors who peddle products that "only" require configuration, as opposed to coding. Sadly, as we know since at least 2005, most configuration is really just programming in a poorly designed, rather constrained language without decent tooling or useful documentation . Being afraid of code and being unfamiliar with the modern development lifecycle is a dangerous proposition in a world where everything is becoming defined by software.

SDX – Software-defined Anything

Blue Screen of Death Much of traditional IT infrastructure is either hard-wired or semi-manually configured: servers are racked and cabled, network switches are configured manually with tools or config files. Operations staff is often quite happy with this state of affairs because it keeps the programmer types away from the critical infrastructure where the last thing we need is bugs and stuff like "agile" development, which is still widely (mis-)interpreted as just doing something random and hoping for the best.

This is rapidly changing, though, and that's a good thing. The continuing virtualization of infrastructure makes resources that were once shipped by truck or wired by hand available via a call to an OpenStack API. It's like going from haggling in a car dealership and waiting 4 months for delivery just to find out that you should have gotten the better seats after all to locating a Zipcar / DriveNow from your mobile phone and driving off 3 minutes later. Virtualized infrastructure is an essential element to keeping up with the scalability and evolution demands of digital applications. You can't build an agile business model when it takes you 4 weeks to get a server and 4 months to get it on the right network segment.

Operating-system level virtualization is by no means a new invention, but the "software-defined" trend has extended to SDN, Software-defined Networks, and full-blown SDDC, Software Defined Data Centers. If that isn't enough, you can opt for SDx – Software Defined Anything, which includes virtualization of compute, storage, network and whatever else can be found in a data center, hopefully in some coordinated manner.

As so often, it's easy to look into the future of IT by reading those Google research papers that describe their systems of 5+ years ago (side note: finally there is an official paper on Borg, Google's cluster manager). To get a glimpse of where SDN is headed, look at what Google has done with the so-called Jupiter Network Architecture (or jump straight to the SIGCOMM Paper). If you are too busy to read the whole thing, this three-liner will do to get you excited: "Our latest-generation Jupiter network […] delivering more than 1 petabit/sec of total bisection bandwidth. This means that each of 100,000 servers can communicate with one another in an arbitrary pattern at 10Gb/s." This can only be achieved by having a network infrastructure that can be configured based on the applications' needs and is considered as an integral part of the overall infrastructure virtualization.

Automate Everything!

I once shocked a large portion of our IT staff by defining my strategy as: "automate everything and make those parts that can't be automated a self-service"". The reaction ranged from confusion and disbelief to anger. Still, this is exactly what Amazon & Co have done. And they have revolutionized how people access IT infrastructure. If corporate IT infrastructure wants to remain competitive, this is the way corporate IT got to be thinking!

In a recent meeting one of our vendors' architect stated that automation should not be implemented for infrequently performed tasks because it isn't economically viable. Basically they calculated that writing the automation would take more hours than would ever be spent completing the task manually (they also appear to be on a fixed-price contract). I challenged this reasoning with the argument of repeatability and traceability: wherever humans are involved, mistakes are bound to happen and work will get performed ad-hoc without proper documentation. The error rate is likely the highest for infrequently performed tasks as the operators are lacking the routine. Of course, the whole argument collapses when you think about disaster recovery scenarios: one hopes that they occur infrequently, but when they happen you better be fully automated. Otherwise you may argue that the fire brigade should use a bucket chain as all those fire trucks and pumps are not economically viable given how few fires we have.

Computers are much better at executing repetitive tasks while we humans are better at coming up with new ideas, designing things or automating stuff. We should stick to this separation of duty and let the machines do the repeatable tasks without fearing that Skynet will take over the world any moment.

The Loomers’ Riot?

New tools necessitate a new way of thinking, though, to be useful. It's the old "a fool with a tool is still a fool". I actually don't like this saying because you don't have to be a fool to be unfamiliar with a new tool and a new way of thinking. For example, many folks in infrastructure and operations are far detached from the way contemporary software development is done. This doesn't make them fools in any way, but it prevents them from migrating into the "software-defined" world. They may have never heard of unit tests or build pipelines and may have been led to believe that "agile" is a synonym for "haphazard". They have not had enough time to conclude that immutability is an essential property and that rebuilding / regenerating something from scratch beats making incremental changes.

As a result, despite often being the bottleneck in an IT ecosystem that demands ever faster changes and innovation cycles, operations teams are often not ready to hand over their domain to the "application folk" who can script the heck out of the software-defined anything. One could posit that such behavior is akin to the Loomer Riots because the economic benefits of a software-defined infrastructure are too strong for anyone to put a stop to it. At the same time, it's important to get those folks on board that keep the lights on and understand the existing systems the best, so we can't ignore this dilemma.

Explaining to everyone What is Code? is certainly a step in the right direction. As I often say: "if software eats the world, there will be only 2 kinds of people: those who tell the machines what to do and those where it's the opposite." Having more senior management role models who can code would be another good step. However, living successfully in a software-defined world is not a simple matter of learning programming or scripting.

Undoing by Redoing

A great example how people have to start thinking differently was provided by the same vendor of ours when brought the argument of reversibility: if a configuration is not working, we need to quickly revert to the last known stable state to minimize the recovery time. If you have made manual updates, this is very difficult at best. In a software-defined, automated world this becomes much easier. The vendor countered that this would require an explicit "undo" script for each possible action, which would make it even more expensive or cumbersome to develop the automation.

Think like a Software Developer!

This response highlights that many infrastructure teams don't yet think like software developers. Software developers who have been using software tools and processes for quite some time know that if they can build an artifact such as a binary or a piece of configuration from scratch using an automated build system, they can easily revert to a previous version by resetting version control to the last know good state, rebuilding from scratch and republishing this "undone" configuration. Software build processes don't know "undo", they simply "redo".

In complex software projects this is a quite normal procedure, often instigated by the so-called "build cop" after the build goes "red" because of failing tests. The build cop will try to get the developer who checked in the offending code to fix it quickly or simply reverts that code submission. Modern configuration automation tools have a similar ability to regain a known stable, so one can apply the same process of reverting and automatically re-configuring.

Use a Build Pipeline

The concept of "no snowflake" or "no pet" servers underline the spirit that in the software-defined world a server (or network) configuration can be recreated automatically with ease, similar to re-creating a build artefact such as a jar file: when infrastructure is software-defined, you can provision or configure it in the same manner you would create a software artifact. You no longer have to be afraid to mess up a specific instance of a server because you can easily recreate that specific configuration via software in minutes.

Just like continuous integration can rebuild a software system frequently and predictably, software defined infrastructure is not just about replacing hardware configuration with software but primarily about adopting a rigorous development lifecycle based around disciplined development, automated testing, and continuous integration.

Automated Quality Checks

In another Google example from 8 years ago, one of the most critical pieces of request routing infrastructure was configured via a file consisting of hundreds of regular expressions. Of course, this file was under version control as it should be. Even though proposed changes had to be reviewed by an owner of this specific code branch, invariably someday someone checked in a misconfiguration, which brought down most of Google's services because the relative URL path of Google's site was no longer routed to the corresponding service instance. But the answer was not to disallow changes to this file as that would have slowed things down. Instead, the problem was quickly undone as the file was under version control and the previous version was available. Second, additional automatic checks were implemented in the code submit pipeline to make sure that syntax errors or conflicting regular expressions are detected before the file is checked into the code repository, not when it is pushed to production. When working with software-defined infrastructure you need to work like you would in professional software development.

Proper Language

One curiosity about Google is that no one working there ever used buzzwords like "Big data", "Cloud", or "Software-defined data center" because Google had all these things well before the buzzwords were created by industry analysts: much of Google's infrastructure was software-defined 10 years ago. One aspect of deploying large numbers of tasks onto a data center infrastructure is the configuration of the individual process instances: each instance is likely to require small variations in configuration. For example, front-ends 1 through 4 may connect to back-end A while front-ends 5-7 connect to back-end B. Maintaining individual config files for each instance would be cumbersome and error-prone, especially as you scale up and down. Instead, each configuration is generated and passed to the process via the command line. Google avoided the trap of configuration files and decided to use a well-defined functional language called BCL, which supports templates and value inheritance. It also provides built-on functions like map() that are convenient for manipulating lists of values for such configurations.

Learning a custom functional language to describe deployment descriptors for jobs may not everyone's cup of tea but it highlights what a real software-defined system looks like. As the language BCL was widely used and configuration programs got bigger and more complex, debugging became an issue. As a result, folks started to write an expression evaluator and unit testing tools around the configuration language. That's what software people would do to solve the problem: solve software problems with software!

Software eats the world, one revision at a time

There's much more to being software-defined than a few scripts and config files. Rather, it's about making infrastructure part of your software development lifecycle (SDLC). First, make sure your SDLC is fast but disciplined and automated but quality-oriented. Second, apply the same thinking to your software-defined infrastructure. Or else you may end up with SDA, Software-Defined Armageddon.