Sudoku and Enterprise Architecture

Roger Sessions makes a point comparing solving Sudoku puzzles and doing Enterprise Architecture. The comparison isn’t that far-fetched, actually! For example, there are many dimensions of constraints, and at any point, you shouldn’t just guess what to do, because if you make the wrong guess, you won’t see that until you’ve been through a long, horrid path of wasted time.


Architecture Astronauts and Web 2.0

Joel Spolsky is so very annoyed by the Web 2.0 term:

The term Web 2.0 particularly bugs me. It’s not a real concept. It has
no meaning. It’s a big, vague, nebulous cloud of pure architectural

I wonder if the problem comes when we attribute something "architectural" to it. I don’t think it is. As some kind of "visionary fluffy concept", it can certainly have some meaning. I have no problem with that, such concepts can be OK for what they are. But the architecture astronauts will most certainly use it for their purposes; I guess that’s when we’ll all be annoyed.

Developer and Vendor Oriented Languages

DevHawk (Harry Pierson) has written an interesting post about "vendor oriented" versus "developer oriented" languages. I agree with his remark that

Projects don’t fail because developers can’t change the language’s concept of
inheritance. They fail because the gap between the abstractions provided by the
language and the abstractions needed by the solution are enormous. Modern
software development is like building skyscrapers with Lego blocks.

Certainly, in order to build the abstractions to bridge that abstraction gap, all the proper lower level building blocks and abstractions need to be in place (that’s an argument for adding "int?" to C#, for example). But those lower level building blocks do not need to be modifiable! Changing the concept of inheritance, huh? Sure, that could be nice at a research institution, but not in industrial software.

AJAX and GUI Responsiveness

I’ve always been thinking that web GUI applications are terribly stone-ageish, compared to their rich-client counterparts. But with the advent of AJAX and Web 2.0, we’ve actually gotten something that’s better than what we get in traditional GUIs: GUI responsiveness. In order to make an ordinary GUI responsive, we have to let user actions spawn actions in separate threads, which cannot directly update the GUI, since typically, GUI APIs don’t allow updates performed in anything but the main thread. So, lazy programmers always skip making some of those operations asynchronous, and the GUI won’t be very responsive as a result.
But web GUIs always respond to clicks (well, as far as the browser GUI is responsive; that’s not always the case, just watch Internet Explorer), so the web application GUI itself will be responsive whether the developer of it wants or not. So the developers are forced to address this, on the server. But no longer does the problem of updating the GUI in a single thread exist, since the browser and server are separated over the network.
It is possible to build response rich-client GUIs, but too few people are actually doing it.

Update (same day): The responsiveness has always been there in web applications, of course, regardless of whether AJAX has been used or not, but not until now, web GUIs have reached a standard where they can favourably be compared to traditional desktop GUIs.

Asynchronicity and Processor Load

Today I’d like to show a perhaps-not-so-obvious connection between system architecture and performance. I’ll start by setting the stage.

I’m working on a telecom system whose main component’s general architecture is like this: service logic + supporting subsystems. The service logic is all coded by many (large) finite state machines, whose states in turn are using the supporting subsystems for its services. The services of the supporting subsystems are all asynchronous, meaning that calls on its main functions aren’t blocking, so that the results of the calls are returned later, on another thread, to the event queue of the requesting finite state machine. For example, a call over the network, or even opening a file on the hard disk, is blocking (involves waiting), so it needs to be queued for later execution on another thread. The result events from the supporting subsystems are then driving the state machines forwards.

The threads driving the finite state machines are running at a high priority ("time critical" using Windows terminology), whereas the supporting subsystems’ threads are running at lower priorities. Activity logging to disk is done by a thread running at the very lowest priority.

What happens when we’re running this system under high (constant) load? (Guess what, it’s means many phone calls arriving!) Then you see, empirically, that the processor load generally doesn’t vary much over time. Graphically, it’s almost a straight horizontal line. Apparently the service logic parts distribute all blocking operations pretty evenly over time, and there’s a fixed number of threads in the supporting subsystems, who can’t do more than a fixed maximum amount of work at the same time.

Now, we could say that this situation is good for scalability, in the following two ways:

  • Suppose that we’re running at 20 percent roughly constant processor load. Then, doubling the load would (note, naïvely, but see the next point!) make the resulting processor load to be 40 percent. Another system with not-so-evenly distributed processor load, but with the same average processor load, might not enjoy such scalability properties. For example, it could have 60 percent processor load during one fifth of the time, and 10 percent processor load during the remaning time. Then the peak load of the scaled-up system would go "through the roof".
  • If we scale up the system by adding more finite state machines, each operation, or "tick", of those state machines is never blocking, so it actually takes a negligible amount of time. So until the sum of all those "tick" times becomes significant, we would enjoy pretty good scalability.

Scalability is never that simple, of course. But these are good properties for scalability anyway.

Now, an interesting question is whether this arrangement between finite state machines with only non-blocking operations on one hand, and asynchronous supporting subsystems on the other hand, is necessary for obtaining such an even processor load, and nice scalability properties. That is, we want to see what happens when the "ticks" of the state machines start to take some small significant amount of time, contrary to the second case above.

We’ll make a simple calculation to illustrate what happens. Suppose that we want to keep 4000 state machines running, and that each state would contain a blocking operation that takes one millisecond to perform. This would mean that if every state machine is ticked at least once per second, the ticking operations during that second would amount to four seconds. Oops! Didn’t work. That would lead to 100 percent CPU load, and probably some timeouts would be triggered, too. We could imagine situations with shorter and fewer blocking operations, but you see the general problem from this example. Many small blocking times add up to a long time when there are lots of state machines. Even if you would have several processors, you would just postpone the problem a bit. You’ll also lose some "fairness" in the state machine execution; you wouldn’t be able to expect a regular frequency of "ticks" to a state machine.

I guess that if we could know the maximum blocking time for an operation, we might be able to build an architecture like this with some blocking operations in the state machine. But typically we don’t know that. That’s usually why the operations are blocking, because we need to wait for something, like a network response. And still, it wouldn’t scale as well as the non-blocking architecture. And above all, it’s not that difficult to make synchronous operations asynchronous by performing them in a separate thread, so there’s no actual reason for having blocking operations in a state machine anyway.

So, there you’ve got some arguments for a nice "state machines + asynchronous services" architecture. I believe that this is about how close you can bring Windows to being a real-time operating system.

[Update 2005-10-17: corrected some spelling mistakes.]

Why Smart People Defend Bad Ideas

This is an old essay, but anyway, it’s good! It’s Scott Berkun, again, who writes about how and why smart people are so good at defending bad ideas. Applied to software in particular. This sentence from the essay states the problem very well:

Smart people, or at least those whose brains have good first gears, use their speed in thought to overpower others.

I like the expression "those whose brains have good first gears"! 😉

The Common Sense Behind the ATAM

I thought I’d better say a few things about the Architecture Tradeoff Analysis Method, too. It’s really built upon common sense (for an architect, right), even if I can’t really judge if the building itself reaches way too far above the clouds. For a small or even medium sized organization, the answer is definitely yes. However, that doesn’t matter. The pieces of common sense behind it are good, and somewhat nontrivial. My idea is that anyone doing architectural work could benefit from those pieces, regardless of whether you run the actual method or not.

First, I’ll readily admit that I haven’t even understood the complete ATAM. I’ve only read and understood an old overview paper, and it seems that the method has evolved a lot since 1998, when it was written. Perhaps I’ll come back with corrections when I’ve read all about it (if I ever will). I promise only to tell you things that make sense to me, anyway! If you’re annoyed by this, just pretend that the paper has just been published! 😉

So, what is it all about? Actually, the abstract of the overview paper says a lot. Here it is:

This paper presents the Architecture Tradeoff Analysis Method (ATAM), a structured technique for understanding the tradeoffs inherent in the architectures of software intensive systems. This method was developed to provide a principled way to evaluate a software architecture’s fitness with respect to multiple competing quality attributes: modifiability, security, performance, availability, and so forth. These attributes interact—improving one often comes at the price of worsening one or more of the others—as is shown in the paper, and the method helps us to reason about architectural decisions that affect quality attribute interactions. The ATAM is a spiral model of design: one of postulating candidate architectures followed by analysis and risk mitigation, leading to refined architectures.

OK, that makes sense. If we add another server to increase availability, we increase the cost, and perhaps decrease security, if we aren’t careful. Perhaps we have to co-locate lots of code in order to increase performance, thus making the architecture less modifiable. ATAM is a method for making these tradeoffs explicit, and to have a structured (iterative) method of getting to a software architecture that satisfies all the requirements on those properties.

It’s important to note that ATAM itself does not include ways of assessing modifiability, performance, security and all that, sub-methods, such as the SAAM (or perhaps common sense) are used for obtaining those attributes. It’s really a "meta-method". But never mind, we’re not really interested in formalities of the method itself now.

I suggest that we dive directly into the steps of the method; they aren’t that difficult to understand.

  1. Collect use cases that should be supported by the architecture, and requirements that the resulting system should satisfy.
  2. Construct a nice architecture based on what you got in the previous step.
  3. Analyze all the relevant properties (or attributes, as the terminology goes), such as modifiability, availability, performance and so on.
  4. If all the relevant properties were good enough, we’re done, and we can proceed to design and implementation! (But if you’re a bit curious, you could actually go on anyway, for a round.) Otherwise, we know that we need to modify the architecture in order to improve upon one or more attributes.
  5. Look at several (sensible) ways of modifying the architecture, and see how the properties of the architecture changes. For example adding a server might increase availability and cost; like that. The properties that actually changed significantly are noted as sensitivity points.
  6. Look at what you got in step 3. Some of the changes you made to the architecture are likely to have affected more than one of the attributes. For example availability and cost, for adding a server. Those changes (scenarios, perhaps?) are noted as tradeoff points. Those are the points where we have to be careful when changing our architecture. Perhaps the properties we have to improve upon are connected to lots of other attributes in this way?
  7. Now, we use the knowledge about the tradeoff points found in the previous step, and redesign the architecture so that we believe that we’ve come closer to satisfying the requirements on attributes. The tradeoff points simply serve as guides for us here. For example, if your company has no budget for new hardware, perhaps you have to have another way of establishing the availability requirements than adding another server.
  8. Go back to step 3.

OK, this really looks like common sense, all of it. Probably, if you’re an architect, this is what your brain is already doing, or at least something like it. But anyway, I think we can gain a lot from this kind of "formalized common sense".  We can use it when it comes to communicating this kind of knowledge to others, and also if we want to check ourselves to make sure that we’re reasoning in a sound way (perhaps at times when you’re working a lot of overtime, and aren’t fully alert!). Sometimes our brains aren’t as accessible as we’d like them to. 🙂