A few thoughts for a Monday…

Life has changed a lot lately.

First, I changed employers; I’ve left Red Hat (for various reasons) and am now working for a company that will give me a little more direct purpose, along with an imperative for using Scala. (Did you see what I did there? Scala’s typically a functional language, not an imperative one, so.. um… a pun’s still funny if you have to explain it right?)

Second, my wife and I both use the Microsoft Surface 3, in addition to our regular working computers. This means that unless you count virtual servers, I’m using more instances of Windows than anything else. (If you factor in remote hosts, Linux is still ahead.)

It’s an interesting set of changes. The job is going to be a lot of fun – a lot of work, but it’s a good fit for me, I think. Red Hat’s still a great company; if I’d not changed jobs, I’d still have been pleased to work for them.

However, this is a good change, and Scala is something I’ve enjoyed as a very casual observer; this gives me a chance to really roll up my sleeves and approach the language with a real intent and purpose.

Functional programming’s going to take some getting used to, though. I’m reading an excellent book, “Functional Programming in Scala,” but deep into the book – I cheated and read ahead – it says “Cause side-effects but shuffle them around so their location is more tolerable.” (Not a quote, even though it’s in quotes.) I’m not sure that’s a terrible idea – at some point you do have to have an actual effect – but the way it’s offered is odd. It introduces an inconsistency in the FP paradigm, and then handwaves it away.

The Surface is a wonderful device. Since it’s running the same platform as my other Windows devices (as opposed to the Surface 2’s Windows RT) I can migrate seamlessly between computers; the form factor is different, but little else is.

The form factor, however, is quite a change. I find the lack of an Insert key to be very much the absolute worst thing about the Surface – a lot of programs use it, and they left it off of the physical keyboard. (The virtual keyboard has it but eats up a lot of your screen space. I’ll use it on occasion, but definitely I far prefer the physical keyboard, which is part of the cover.)

Apart from the Insert key, the right mouse button relies on time of contact, which is something to get used to; an alternative is to use a Bluetooth mouse (or USB mouse). I’m still adjusting to that aspect of the Surface, but it’s less of a deleterious effect than the lack of a physical Insert key.

With all of that said, the Surface’ ability to be a tablet is fantastic. It’s my new book reader, hands down; the larger screen (compared to the Nook HD+) and more solid feel (along with the much higher resolution) is excellent for reading.

However, I’m moving away from Amazon; most of my library’s in the Kindle format, and I suppose that can’t entirely change (what with prior investment) but honestly, the Nook Reader application on Windows is far better; with the Amazon application for Windows, I can’t read my own content, and with Nook I can. I’m an author; I get my own books in EPUB format, and I purchase a lot of books in EPUB. With the Kindle for PC app, I can’t read them on the Surface; I have to use the Nook HD+ or my phone.

It’s not good that I can’t use my Surface for reading my own content. It’s not the Surface’ fault – it’s Amazon’s. If and when Amazon fixes it, I’ll be willing to use their reader again (and I’d prefer it, because that’s where most of my library is), but until then, Nook for Windows wins by simple default. It’s just as readable, and the fact that I can import my own content makes it a huge win.

Adding arguments to CentOS boot

I’m running CentOS in a VirtualBox image (so I can play with Docker while using Windows, because … um… because). I don’t want X running; I am happiest with the command line for this application. So the screen resolution is really important to me; I don’t want an 80-by-25 console, I want something better.

It’s actually really easy, once you know how – that’s why I’m writing it up, so someone else can see exactly how I did it.

First, I needed to determine what values to use. You can work out potential values using hwinfo, which isn’t part of the CentOS repositories – you’ll have to install it from GhettoForge.

To be fair, I knew what values I wanted – I wanted mode 792, or 0x318.

(You should figure out what mode you want, then try it on your kernel before setting up the boot parameters. You have been warned.)

All I needed to do to set the boot param was:

sudo grubby --args=vga=792 --update-kernel=ALL

With that, when I booted CentOS, I get a larger console (but not a giant console) perfectly suited for what I wanted.

Smart Grids series reposted

I just republished an article I’d worked on for Red Hat, called “Smart Grids.” It’s got twelve parts already, with two more not actually finalized; I’d been pulled off of the smart grids article and never quite got back around to it. I’d still like to, though. It’s a good walkthrough of an “Internet of Things” architecture.

Check it out!

Repost: Caches, an unpopular opinion, explained

I have an unpopular opinion: caches indicate a general failure, when used for live data. This statement tends to ruffle feathers, because caches are very common, accepted as a standard salve for a very common problem; I’ve been told that cloud-based vendors say that caches are the key to scaling well in the cloud.

They’re wrong and they’re right. In fact, they’re mostly right – as long as some crucial conditions are fulfilled. We’ll get to those conditions, but it’s important we get some important details clear.

Caches are fine for many situations.

In data, it’s important to think in terms of read/write ratios. Therefore, you have read-only data (like, oh, a list of the US States), read-mostly data (user profiles), read/write data (live order information), write-mostly and write-only data (audit transactions or event logging). Obviously someone might read audit logs some day, so the definitions aren’t purely strict, but we’re talking in the scope of a given application’s normal use, so it might be appropriate to think of audit logs as write-only, because a given application might not use them itself.

The read/write status is crucial for determining whether a cache is appropriate. It’s entirely appropriate to cache read-only data, given resources.

In addition, temporary data that’s reused can be cached. Think of, oh, a rendered stylesheet or a blog post: it changes rarely (unless they’re written by me, where they get constantly edited), is requested often (unless it’s written by me, in which case it has one reader), and the render phase is slow in comparison to the delivery of the content (because Markdown, while very fast, isn’t as fast as retrieving already-rendered content.)

Caches for Live Data

The use of cache on live data, where the data is read-mostly to write-only, is what I find distasteful. There are circumstances which justify their use, as I’ve already said, but in general the existence of a cache indicates an opportunity for improvement.

In my opinion, you should plan on spending zero lines of code on caching of live data. That said, let’s point out when that’s not true or possible.

In the typical architecture for caches, you have an application that, on read, checks the cache (a “side cache”) for a given key value; if the key exists in the cache, the cached data is used. If not, the database is read, and the cache is updated with the key value (for future use). When writes occur, the cache is updated at the same time the database is, so that for a given cache node the database and the cache are synchronized.

The database here is the “System of Record,” the authoritative data source for a given data element. The cache holds a copy of the data element, and shouldn’t be considered “correct” unless it agrees with the system of record.

You can probably see a quick issue (and one that’s addressed by many caching systems): distribution. If you have many client systems, you have many caches, and therefore many copies, each considered accurate as long as they agree with the system of record. If one system updates the system of record, the other cached copies are now wrong until they are synchronized.

Depending on the nature of the data, maintaining accurate copies could require polling the database even before the cached copy is used. Cache performance in this situation gets flushed down the tubes, because the only time a cache provides a real enhancement is if the data element is large enough that transfer to the application takes much longer than, well, it should. (A data item that’s 70 kb, for example, is probably going to take a few milliseconds to transfer – more than checking a timestamp would – and therefore you’d still see a benefit even while checking timestamps.)

Some caching systems (most of them, really) provide distributed cache, so that a write to one cache node is reflected in the other caches, too. This evades the whole “out of sync” issue, but introduces some other concerns… but this is something you should have before even considering caching.

If you’re going to use a cache, it should be distributed. You should look for a peer-to-peer distribution model, and transaction support; if your cache doesn’t have these two features, you should look for another cache. (Or you could just use GigaSpaces XAP, which does this and more; read on for an explanation.)

So what should you really do?

To me, the problem lies in the determination of the System of Record. A cached value is a copy; I don’t think it’s normally necessary to have that copy, and it’s actually fairly dangerous. So what’s the alternative?

Why not use a System of Record that’s as fast as a cache? If the data happens to be cached, you don’t care beyond reaping the benefits; your data architecture gets much simpler (no more code for caches). Your application speeds up (dramatically, actually, because data access time is in microseconds rather than milliseconds… or seconds). Your transactions collide less because the transaction times go down so much. Everybody wins.

The term for this kind of thing is “data grid.” Most data grids are termed “in-memory data grids,” which sounds scary for a System of Record, but there are easy ways to add reliability. Let’s work out how that happens for GigaSpaces, because that’s who I work for.

In a typical environment, you’d have a group of nodes participating in a given cluster. These nodes have one of three primary roles: primary, backup, and mirror (with the mirror being a single node.) A backup node is assigned to one and only one primary; a primary can have multiple backups. The primaries are peers (using a mesh-style network topology, communicating directly with each other.)

Let’s talk about reads, first, because writes want more examination than reads will require. A client application has a direct connection to the primaries; depending on the nature of the queries, requests for a given data element are either distributed across all primaries (much like a map/reduce operation) or routed directly to a calculated primary (i.e., the client knows where a given piece of data should be, and doesn’t spam the network).

Now, before we jump into writes, let’s consider my originating premise: stated too simply, caches are bad, because data retrieval is slow. In this situation, the reads themselves are very, very fast because you’re talking to an in-memory data grid, not a filesystem at all, but you still have to factor in network travel time, right?

No, not always. XAP is an application platform, not just a data grid. The cluster nodes we’ve been talking about can hold your application, not just your application’s data – you can co-locate your business logic (and presentation logic) right alongside your data. If you partition your data carefully, you might not talk to the network at all in the process of a transaction.

Co-located data and business logic comprise the ideal architecture; in this case, you have in-memory access speeds (just like a cache) with far more powerful query capabilities. And with that, we jump to data updates, because that’s the next logical consideration: what happens when we update our data grid that’s as fast as a cache? It’s in memory, so it’s dangerous, right?

No. Being in-memory doesn’t imply unreliability, because of the primary/backup/mirror roles, and synchronization between them. When an update is made to data in a primary, it immediately copies the updated data (and only the updated data) to its backups. This is normally a synchronous operation (so the write operation doesn’t conclude until the primary has replicated the update to its backups).

If a primary should crash (because someone unplugged a server, perhaps, or perhaps someone cut a network cable), a backup is immediately promoted, a new backup is allocated as a replacement for the promoted backup, and the process resumes.

The mirror is a sink; updates are replicated to it, too, but asynchronously. (If the mirror has an issue, mirror updates will queue up so when the mirror resumes function all of the mirror updates occur in sequence.)

In this configuration – again, this is typical – the database becomes fairly static storage. Nothing writes directly to it, because it’s secondary; the actual system of record is the data grid. The secondary storage is appropriate for read-only operations – reports, for example.

Does this mean that the data grid is a “cache with primacy”? No. The data grid in this configuration is not a cache; it’s where the data “lives,” the database is a copy of the data grid and not vice versa.

Does it mean we have to use a special API to get to our data? No. We understand that different data access APIs have different strengths, and different users have different requirements. As a result, we have many APIs to reach your data: a native API, multiple key/value APIs (a map abstraction as well as memcached), the Java Persistence API, JDBC, and more.

Does it mean we still might want a cache for read-only data? Again, no. The data grid is able to provide cache features for you; we actually refer to the above configuration as an “inline cache,” despite its role as a system of record (which makes it not a cache at all, in my opinion). But should you need to hold temporary or rendered data, it’s entirely appropriate to use the data grid as a pure cache, because you can tell the data grid what should and should not be mirrored or backed up.

The caching scenarios documentation on our wiki actually points out cases where a data grid can help you even if it’s not the system of record, too. For example, we’re a powerful side cache when you don’t have enough physical resources (in terms of RAM, for example) to hold your entire active data set.

Side Cache on Steroids

One of the things that I don’t like about side caches is that they typically infect your code. You know the drill: you write a service that queries the cache for a key, and if it isn’t found, you run to the database and load the cache.

This drives me crazy. It’s ugly, unreliable, and forces the cache in your face when it’s not really supposed to be there.

What XAP does is really neat: it can hold a subset of data in the data grid (just like a standard cache would), and it can also load data from secondary storage on demand for you with the external data source API. This means you get all the benefits of a side cache, with none of the code; you write your code as if XAP were the system of record even if it’s not.

So do I REALLY hate caches?

If the question is whether I truly dislike caches or not, the answer would have to be a qualified “no.” Caches really are acceptable in a wide range of circumstances. What I don’t like is the use of caches in a way that exposes their weaknesses and flaws to programmers, forcing the programmers to understand the nuances of the tools they use.

XAP provides features out of the box that protect programmers from having to compensate for the cache. As a result, it provides an ideal environment in which caches are available in multiple ways, configurable by the user to fit specific requirements and service levels.

A final note

You know, I say things like “caches are evil,” because it attracts attention and helps drive discussion in which I can qualify the statement. As I said above, I don’t actually think that – and there are lots of situations in which people have to adapt to local conditions regardless of the “right way” to do things.

Plus, being honest, there’s more than one way to skin a cat, so to speak. My “right way” is very strong, but your “right way” works, too.

And that’s the bottom line, really. Does it work? If yes, regardless of whether the solution is a bloody hack or not, then the solution is right by definition. Pragmatism wins. What I’ve been discussing is a shorter path to a solution that people run into over and over again, when they really shouldn’t. I think it’s the shortest path. (I’ve found no shorter, and I have looked.) It’s not the only path.

Repost: mea culpa: “offheap access is slow”

Steve Harris has been commenting on dzone about my last post, “BigMemory: Heap Envy.” One of his comments linked to a blog post of his, “Direct Buffer Access Is Slow, Really?,” in which he says that direct access is not slow, and therefore one of my points was invalid.

Well, folks, he’s right, for all intents and purposes; it doesn’t change the conclusions about BigMemory (it’s still for people who aren’t willing to, you know, tell the JVM how it’s supposed to manage memory), but direct access is not as slow as first supposed.

I believed their own documentation, and memory exaggerated the effect of the context changes.

See, the Terracotta documentation of BigMemory, in the Storage Hierarchy section, has this quote:

OffHeapStore – fast (one order of magnitude slower than MemoryStore) storage of Serialized objects off heap.

Further, in the introduction of BigMemory (again, on Ehcache.org), you find this:

Serialization and deserialization take place on putting and getting from the store. This means that the off-heap store is slower in an absolute sense (around 10 times slower than the MemoryStore)

Like an idiot, I took “one order of magnitude” and “10 times slower than the MemoryStore,” and thought “ouch.” Looking at Steve’s measurements and my own, it’s slower – but definitely not by much, and your allocation patterns definitely affect the speed.

Therefore, I don’t think it’s fair to point out the direct buffer access as a problem.

That said, I haven’t seen anything that mitigates the cost of serialization, which was the primary point I was trying to make; offheap access time wasn’t crucial.

So I’m glad Steve has been posting about this; it’s challenged the assumptions I gathered from reading their pages on BigMemory.

It hasn’t changed my initial analysis: it’s an idea that others have used, and discarded, because it hasn’t proven necessary.

Why don’t I value BigMemory? Because I ran their test and was able to get better response times and latency with the same data set sizes and JVM heap sizes, with the basic ConcurrentHashMap.

If you have needs that specify ehcache with giant cache sizes, and you can’t tune your JVM through lack of access or knowledge or whatever, BigMemory might help you some. (Can’t help you if you can’t access your JVM startup parameters, but the knowledge thing – there’s this intarweb thing I’ve heard about…)

I still think a side cache is a signal that your data access mechanism is too slow for your uses, and you should try to fix that rather than adding extra parts.

Author’s Note: Reposted as clarification for Repost: BigMemory: Heap Envy.

Repost: BigMemory: Heap Envy

Terracotta has announced the availability of BigMemory, which provides a large offheap cache through their Ehcache project. It is designed to avoid the GC impact caused by massive heaps in Java, at a license cost of $500 per GB per year, if I have my figures right.

The Reason We’re Here

First, let’s understand the reason BigMemory exists at all: the nature of the JVM heap, the standard (“Sun”) JVM memory model.

In very basic summary, there are two generations of objects in the Sun JVM: a young generation and an old generation. Simply put, garbage collection occurs when a generation fills up.

A young generation’s garbage collection is normally very quick; the young generation space isn’t usually very large. However, the old generation’s garbage collection phase can be very long, depending on a lot of factors – the simplest of which is the size of the old generation. (The larger the old generation is, the longer garbage collection will take… more or less.)

The problem addressed by BigMemory is the existence of a large old generation (let’s say, larger than ten gigabytes). When you have a lot of read-only data (as in, a side-cache like Ehcache?), and an old-generation GC occurs, the garbage collector potentially has to walk through a lot of data that isn’t eligible for collection; that takes time, and slows the application down. Some operations cause it to block the application.

Let’s all agree: blocking the application is usually unpleasant.

Is BigMemory a Good Thing?

From the standpoint of garbage collection being bad, when used with large heaps: BigMemory looks like a huge win! No more GC associated with giant cache regions!

That’s not all there is to the situation. Let’s think about this.

You’re talking about optimization of a cache, a cache built around key/value pairs – a map with benefits, more or less. (Namely, expiration of data, and a few other things that we’ll discuss soon.)

Cache isn’t usually the main problem.

Cache is usually not the primary problem for an application; cache is a way of hiding how expensive data access has become, in most cases. It’s a symptom of your database being too slow, for whatever reason, and by whatever definition. (I’m ignoring memoization and computational results, which usually don’t end up being gigabytes’ worth of data anyway.)

Cache moves the problem around: in BigMemory’s case, the problem has gone from “slow data access” to “lots of data causes our app to slow down unacceptably.” There are still costs: serialization still takes time, key management is a problem, and merely accessing the data can be slow.

Key Management

Key management is a factor because you still have to know how to access your data. If you know the object’s primary key, of course, that’s an easy reference to use, but what if you don’t know the object’s primary key? (The object’s lost, that’s what.) A side cache is ideal for working with data that your application is well aware of, but not for data that your application has to derive.

Consider memcached; there are good books on memcached that actually suggest using a SQL query for a cache key! The key, in some cases, is larger than the data item generated by the query. Ehcache isn’t going to have a different solution.

Serialization and Offheap Access

Serialization factors in because of the way BigMemory works.

BigMemory allocates memory directly in the OS, through an NIO ByteBuffer, and manages references itself. That part is good, but there are two issues here: serialization and access time.

Serialization factors in whenever the OS’ memory is used: since the memory region is a set of bytes, Java has to serialize the cached objects into that memory region on demand. Serialization is not the fastest mechanism Java has – looking at the timings of remote calls in Java, serialization is more expensive than the network call itself is. Your offheap writes and reads are going to be slow, period, just through serialization.

Plus, accessing the byte buffer itself is slow (because accessing offheap memory is slow.) This is important, but less than it could be, because BigMemory identifies a “hot set” of data – they say 10% of the stored data – and keep it “on-heap” for rapid access (which presumably avoids serialization, too.)

This is a good thing, sort of – but it also has an implication for your heap size. I don’t know offhand how hard of a limit that hot set percentage is, but if it’s something that’s not adjustable, your heap size will always have to be able to handle the hot set’s size – let’s say 10% of the size of the offheap storage as a rough estimate.

This establishes a limit on the offheap size, too, because a JVM heap that’s too large (to satisfy the needs of the BigMemory offheap cache) would suffer the very same problems BigMemory is trying to help you avoid.

What about the GC time itself?

BigMemory doesn’t actually get rid of GC pauses. It only removes the need to garbage-collect the offheap data (again, roughly 90% of the data lives offheap, as a “cool set” compared to the 10% “hot set.”) Even Terracotta’s documentation shows GC pauses, although they haven’t demonstrated the tuning associated with the pauses.

Actual Impact of BigMemory

If your application is sensitive to any pauses at all, BigMemory’s going to… help a little, but not that much, because pauses will still exist. They’ll have less impact, and your SLA factors in very heavily here, but they’ll still be there, caused by key management at the very least.

The only way to fix those pauses is to fix the application, really. JVM tuning can help, but realistically, an application that needs a giant cache like that has been built the wrong way; you’re far, far, FAR better off localizing smaller slices of your data into separate JVMs, which can communicate via IPC, than you are by pretending a giant heap will make your troubles go away.

So should you buy BigMemory? – A comparison with other mechanisms.

Well, I’d say no, with all due respect to our friends at Terracotta. I actually took their tests and ran comparisons against the JVM’s poor ConcurrentHashMap, and found some really interesting numbers:

ConcurrentHashMap won, by a lot, even in a test calculated to abuse the cache, and we’re still factoring in garbage collection.

The problem statement looked something like this: can we run a large, simple key/value store, consisting of a large amount of data, and avoid garbage collection pauses of over a second? (Remember, this is the statement they’re using to justify the development of BigMemory in the first place.)

The answer is: yes, although the solution takes different shapes depending on what the sensitive aspects of the application are.

For example, on our test system, with a 90 GB heap and fifty threads, 50% read/write operations, we had an average latency of 184 uSec, with some outliers. The outliers cause our hypothesis to fail, however. (This addresses the actual performance of the cache, though: 50 threads accessing a single map, which … isn’t very kind.)

Further, the most important factor in accessing our map isn’t the size of the map or the GC involved, but the number of threads in the JVM. If we use the same 90GB heap, giant ConcurrentHashMap, twenty-five threads – our normal latency time drops to around 80 usec, again with outliers in the Sun JVM.

And total throughput? I’m sorry, but BigMemory fails here, too. A BigMemory proponent mentions having 200000  transactions per second on a 100GB heap. We were looking at the millions with ConcurrentHashMap – even with the latency impact of thread synchronization. With a more normal cache usage scenario – closer to a 90% read/10% write situation – the numbers for the stock JVM collections climb even more.

Back to latency: if we do something drastic – like, oh, use a higher-performance JVM like JRockit, even without deterministic GC – then we get similar latency numbers, except the time spent in GC disappears, to something like the 400msec range.  Plus, our hypothesis passes muster, literally – giant heap, no GC pauses of over a second, no investment in anything (besides JRockit itself, of course, which Oracle suggests will be merged into OpenJDK.)

Now, here’s the summary: We took Terracotta’s test, factored out BigMemory, and got the same results or better, without rearchitecting anything at all.

If you rearchitect to distribute the “application” properly, you could get even better access times, and even larger data sets, with less discernable GC impact.

That’s power – and, sadly, it doesn’t make BigMemory look like a big splash at all.

Author’s Note: Repost.

Repost: It’s all about the boundaries, baby

The key to distributed – and enterprise – computing is boundary management. Even worse, it’s not conceptual boundary management – it’s real boundary management, which is a lot harder to figure out.

A boundary is a barrier between things or places, right? Well, consider a web browser and a web server; the TCP/IP layer is a boundary.

The application has a boundary between it and its datastore.

You can play games with definitions here all day, and you’d be right: an object/relational mapper (an ORM) is a way to get around the boundary between your object-oriented language and your tuple-oriented database, for example.

The goal of a boundary, here, is to be as invisible as it can be. ORMs fail because they don’t work invisibly; they affect your data model fairly severely, they don’t leverage the strength of the underlying data store very well (although I suppose they’ll stay out of your way as you do an end-run around them); they affect your code pretty severely as well.

JMS fails, sort of, because it provides an excellent boundary between the originator of a request and the consumer of that request – but the implementations include that pesky network all the time. You can eliminate the network, but then you’re not really distributing.

SOAP… well, it’s just a six pails of fail. REST is better because it’s so much lighter, but it never hides the network layer from anything.

A distributed application should be able to cross network boundaries without making that boundary management stick out.

You should be able to submit a task or data item or event to a system and have that system cross boundaries if necessary, or localize processing if it can. It shouldn’t consume network bandwidth just because you said “this can run somewhere else.” It should use the network only if the local node is inappropriate for handling a given request.

Data should work this way, too, and here’s where I think most distributed computing platforms fail hard; they provide fantastic capabilities to share data (Terracotta DSO, Coherence, and Hadoop comes to mind) or the ability to distribute processing (JMS and GridGain comes to mind) but these have limits.

When you can share data but not processing, you have localization and sharding concerns (some of which are supposed to be manageable by the platform. DSO is supposed to be able to do this, for example, although I couldn’t find any references quickly that discussed how.)

When you can share out a process but not the data, you end up creating a choke point through accessing your data, or you have to carry all your data long with you. (Hello, JMS and friends!)

The key here is to keep your eye on the boundaries between your architectural components – and if those boundaries slow you down, well, that’s where your potential for optimization lies.

It’s impossible to tell you generically whether the optimization is sharding a database for local access to critical data, or task partitioning such that map/reduce is optimized for the mapping phase, or any of a number of other potential optimizations. The distribution platform you’ve chosen will massively impact what your options are here.

Don’t forget that key, though: if you manage the boundaries between components well, your application is going to win.

ObEmployerReference: I work for GigaSpaces, and IMO GigaSpaces XAP manages the boundaries impressively well, automating most of it out of your hands. If you’d like to see how that happens, feel free to point out a problem space and maybe we can show you how XAP can help with it.

Author’s Note: Repost, clearly having been written while I worked for GigaSpaces. Some concepts are out of date, because some of the ideas caught on; this is preserved for posterity.

Repost: Adding high availability in Terracotta DSO

Terracotta DSO is a package for distributing references in a heap across virtual machines. (Thus: Java. I thought it included C#, but Geert Bevin reminded me that I’m an idiot.) That means that if you have a Map, for example, you can set it to be shared, and your application can share it with other VMs without having to be coded for that purpose.

The only real requirement on the part of your code is that it be written correctly. As in, really correctly. (Want help with this? Check out Using Terracotta for Configuration Management, an article I helped write for TheServerSide.com.)

Luckily, DSO helps you do that by pointing out failures when you screw up.

Anyway, the way DSO works topologically is through a hub and spoke architecture. That means that your VMs are clients, tied to a central server. The hub, then, might seem a potential failure point; if your hub goes down, your clients stop working.

That would be bad. Luckily, DSO has availability options to change the nature of what the “hub” looks like.

The DSO client uses a configuration file named by default as “tc-config.xml.” It has a reference to a server instance that looks something like this:

<servers>
  <server host="localhost">
    <data>%(user.home)/terracotta/server-data</data>
    <logs>%(user.home)/terracotta/server-logs</logs>
  </server>
</servers>

Note the server hostname: it’s important, surprisingly enough. Adding high availability to DSO is only a matter of changing this block in your configuration file.

What we’re configuring here is active/passive failover. That means there’s a primary server instance and a passive server instance. The two server instances sync up, so if the primary goes down, the passive has all of its data; the clients are perfectly able to switch from using one server to the other as their active status changes, so from the client perspective, if the primary server dies, nothing happens.

When the former primary instance returns, well, it’s not the primary any more; it becomes the passive server instance. From the client perspective, all of this is under the covers.

This is Very Good.

So: here’s a configuration I used (tested with Vista running one DSO server with ip 192.168.1.106, running the other DSO server under a Windows XP VMWare image with ip 192.168.1.132):

<servers>
  <server host="192.168.1.106" name="m1">
    <data>%(user.home)/terracotta1/server-data</data>
    <logs>%(user.home)/terracotta1/server-logs</logs>
    <dso>
      <persistence>
        <mode>permanent-store</mode>
      </persistence>
    </dso>
  </server>
  <server host="192.168.1.132" name="m2">
    <data>%(user.home)/terracotta2/server-data</data>
    <logs>%(user.home)/terracotta2/server-logs</logs>
    <dso>
      <persistence>
        <mode>permanent-store</mode>
      </persistence>
    </dso>
  </server>
</servers>

When you start the server, you use a command line argument, like this, with the configuration file in $TERRACOTTA/bin:

./start-tc-server.sh -n m1

This tells the server instance to use the configuration “m1” – yes, I know, but I didn’t have hostnames setup and I wasn’t sure how NAT would affect what I was doing – and the two instances will work out between them which is active and which is passive.

The end result is that I was able to run my shared map program while changing the servers around – starting and stopping them in-process, for example – without the client program knowing or caring. (Note that the client program needs the same basic configuration data; it can get this by having its own copy of the configuration file, which is what I did, or it can load it from the servers, too, or any number of variants. It just needs the relevant data. Happy now, Geert?)

A side-benefit of the configuration is that the DSO cache is persistent – you don’t need high availability for persistency, but you do need persistency for high availability.

It’s good stuff.

Continue reading Repost: Adding high availability in Terracotta DSO