≡ Menu

Smart Grids, Part I

“Smart grid,” as a term, invokes different things for different people.

A computing grid federates resources, which means that it takes a number of disparate pieces (multiple disk drives, multiple CPUs, or multiple network interfaces, or any combination of these) and combines them to present a single apparent interface for others to use.

As an example, Gluster takes multiple filesystems and presents them as a single filesystem for use by an application; this is “grid storage.”

The term “smart grids” typically refers to power grids, which is appropriate but limited in scope.

A power grid can be thought of as a network with multiple routes to deliver power to a given address, able to respond to outages without interruption to the consumers, and able to reconfigure based on availability of routes.

In this, it sounds very much like an ethernet network, where a given TCP/IP packet is redirected upon delivery based on the connections the router might have. Remove a trunk from the Internet and packets will still arrive at their original destination, although perhaps more slowly as they take less optimal routes than previously existed.

However, power grids and networks are infrastructure; we definitely notice when they fail, but we also rarely design non-infrastructure systems that have similar features.

As developers, we’re not used to enduring failures, but we certainly seem to expect our users to tolerate downtime.

We don’t design our applications to be available.

We come up with terms like “highly available” to describe applications that feature the sort of baseline availability we expect from our infrastructure, and this sort of availability is seen as being attainable at high cost.

Is this right? Isn’t this like selling a car, then expecting the consumer to see tires as an optional add-on?

While some applications are certainly less important than others to a given business, shouldn’t a business see its applications as resources that are crucial to its lifeblood?

If Google’s search facility somehow went down, don’t you think their core mission would be violated?

The answer to these, of course, is “yes” – so we, as application architects and developers need to first think about what “availability” might mean, and what level of availability we should aim for, and how to achieve that level of availability.

Service Levels

A “service level” is a description of expected performance under given conditions. It’s typically written up in a “Service Level Agreement,” or an “SLA,” and well written SLAs can read like legal documents yet be very useful for determining delivery criteria for an application.

An SLA is more than a simple requirements document.

A requirements document might have something that says “The application should track the physical location of a box.”

A good SLA would be far more detailed. As an example of a single service level agreement for an application: “The application should be able to track changes in physical location for 100,000 boxes in the space of two seconds, with a granularity of 200 yards, accepting input from physical barcode scanners connected via TCP/IP or radio, through the use of an Android-based GPS device.”

Consider what this tells us:

  1. Inputs. It tells us that it’s tracking geographical locations; the channels for these locations are scanners and Android-based devices over TCP/IP, and radio-based devices as well.
  2. Something about the outputs. It should be able to show the locations; ordinarily, there would be another SLA governing reporting of the outputs.
  3. The number of inputs to expect concurrently, and what “concurrent” means in real time. It specifies that we should be able to handle 100,000 messages in the space of two seconds.

This isn’t a trivial challenge.

The inputs are very simple for two input mechanisms, and might require custom coding for device drivers for the radio-based scanners. However, the 100,000 position changes over two seconds might be interesting to simulate and test.

The outputs here would be very simple, but again, the time constraint combined with the input data size presents an interesting problem.

The time constraint and data size are important. It doesn’t mean that a single position change has to be handled in 1/50000th of a second (20000 nanoseconds); if one position change takes exactly two seconds to process, but the system can actually process 100,000 such position changes concurrently, we’ve fulfilled the requirement.

Service level agreements can be (and should be) very detailed, incorporating requirements about performance (as shown) and physical environments, including network availability and error handling.

They’re like requirements documents that give exact metrics for failure; process only 99,000 inputs in two seconds, and you know your application has not met its requirements for delivery.

For developers, well-written SLAs are also a godsend, because they dictate exactly what needs to be tested for success; with the above SLA, for example, there are multiple individual systems that need to be tested (inputs via TCP or radio, as well as a count of operations to fulfill within a given timespan.) Integration tests would simply combine the individual tests, and would give a “smoke test” for success: if the integration passes, the SLA is met, otherwise it’s not.

Part of the failure of our industry is a lack of awareness of SLAs, along with an unwillingness to require them to be precise.

Concerns for a Smart Application

A “smart application,” using the “smart grid” as a baseline architecture, has a number of characteristics.


The first characteristic is that it’s asynchronous.

Synchronous operations are call-and-response; a producer creates a message (the “call”), and waits for a response.

Asynchronous operations, on the other hand, are represented as sets of “fire-and-forget” processes; a producer creates a call and issues it, and from its perspective, the operation is done. A consumer is waiting for the call, and generates a response to it by becoming a producer itself, and issuing its own “fire-and-forget” process.

It’s as if the synchronous call was inverted: each side (both the producer and consumer) sees itself in both roles. Each one repeats the sequence, and each side looks the same.

There are a lot of ways to make this happen, and the technique used depends very much on the technologies in place. In most architectures today, this is done via a message broker, which might use any of a number of technologies; one popular one is AMQP (the “Advanced Messaging Queuing Protocol”), which has multiple implementations available, including the Fedora AMQP Infrastructure.

However, this is certainly not the only option; other possibilities could be built on Hadoop, or with TCP/IP sockets directly.


Another aspect of highly available systems is that they don’t depend on a single point of failure. Just as multiple routes exist to reach a given IP address, multiple servers exist to handle requests and generate responses.

Therefore, instead of having a single IP address (and server) to broker messages as described in the section on asynchronicity, you’d have groups of multiple servers, preferably in multiple geographic regions.

The application would be configured such that it knew where to find these grouped servers (possibly with a “known good” set of servers that provided links to the others?) and therefore, if one node should happen to go down, another would be available for use with very little delay time.

This is a feature the SLA should address; what kind of behavior is acceptable during a failure? If the answer is “none,” there are ways to handle this, but the architecture has to be designed such that it integrates whichever responses are available, instead of collecting all responses.

Note that multihoming affects every aspect of the application. Components that would need to be replicated include:

  1. HTTP Servers
  2. Message Brokers
  3. Datastores
  4. Processing nodes (i.e., what coordinates the processes between the HTTP servers, message brokers, and datastores)


Depending on the nature of the application and the network upon which it’s deployed, you must consider security. Any data traveling over the open internet has to be considered as exposed; therefore, you have to consider encryption and handshaking to make sure your data isn’t available to anyone who happens to be watching your traffic, in any way.

Outside of noting that security must be addressed and present for real applications, this series is not going to focus on this issue.

Just let it be understood that your application is as insecure as you allow it to be; take precautions.


Another thing you should consider is your platform of implementation… or, more likely, the platforms of implementation.

In the modern environment, “platform” has multiple meanings.

It might mean something as simple as an operating system installed on a specific CPU series: Fedora 17, on an HP DL 460, for example, replicated across your entire architecture. This is probably not realistic.

It might mean a VM, like a Java VM or .Net/Mono environment, installed on any of a number of operating systems (from an embedded ARM processor, to a Windows client, and then to a DL460 running RHEL.)

It might mean a combination of those, along with devices running Android (and thus the Dalvik virtual machine), talking to a client running an application via Mono, connecting to a backend device using a C++ broker providing AMQP, with messages being consumed by applications written with C, C++, C#, and Java.

That last scenario is most realistic, except we obviously left off the iPad and other iOS devices, which are also viable participants. (Adding iOS and .Net primarily has the effect of making the architecture reliant on non-opensource implementations.)

The fact is that few architectures today are homogenous, and you simply have to compensate for heterogenous environments. Your protocols will need to be available for each environment and device, and you must be ready to design messages for each.

What We’ve Covered

So far, we’ve discussed mostly architectural concerns, along with a single artifact that will govern the rest of the development and delivery of your application.

The artifact is a deliverable that determines how, in specific terms, the application will fulfill its requirements. It is not the same as a requirements document, although it is similar; it tends to focus on technical details rather than implementation details.

Depending on the nature of the service levels, then, you should consider asynchronicity, multihomed servers, security, and your choice of implementation platforms as tools available to make your job easier; in the end, being aware of these aspects actually makes your architecture much more strong, and your implementation simpler.