Sunday, August 9, 2009

Cloud Is Stable: Let's Fix Some Design Flaws

We've had a full production cloud environment for a couple of months now and everything is running stable. The networking portion of this offering has provided everything we've hoped for: rapid provisioning, centralized management, and lower cost per customer environment.

We've got roughly 15 enterprise customers on the cloud now and I think the largest challenge we've faced is helping customers understand this new technology and how it changes things for them. For us, it's a really great thing to be able to get an order in for a new environment and then have that environment completely built within a few hours after we speak with the customer for an understanding of exactly what they want. Provisioning a new cluster of firewalls literally takes less than an hour now from start to finish including building the security policies.

The one thing that we've run into that seems to be the biggest challenge resides in the F5 shared load balancing arena. We knew this would present some challenges up front so it wasn't a surprise but we're now really trying to put some focus around it. The problem is that because of where the F5 cluster sits and how we're doing a one-armed load balancing configuration, we can't offer customers the SSL offload capabilities. Even if we could, F5 has no way to limit resources on a per-customer basis. We were promised that we would have this capability with v10 of their software but when it was released, we were disappointed because the features still weren't there.

Here is a summary of the resource allocation issue. The F5 load balancers are very powerful appliances. They can handle lots of requests and are probably the best load balancer on the market. For the cloud offering, these load balancers have to be shared among many different customers. With the way a customer's cloud environment can be carved up and depending on many different factors around the customer's application, IP traffic can look very different from one customer to another. When we designed this solution, we estimated how many customers we would be able to provision in this virtual environment. If a few customers have huge traffic loads or their apps require some kind of special handling for their packets, you could end up chewing up a lot more resources on the F5 cluster than the average customer would. This risk could significantly affect how many customers we actually are able to provision on parts of the cloud infrastructure which of course would reduce overall profits for the offering.

Cisco is not known for their load balancing products but they do seem to have an offering that holds a lot of promise. The ACE load balancer has a virtualization feature in it where we can actually provision customer load balancers in a virtual state off one physical cluster. Resources can be dedicated to a particular customer and they can have guarantees for those resources. For example, we could say that customer X gets 500 transactions per second of SSL handling, can use 10% of the load balancer's CPU, and can have up to 50,000 concurrent sessions running at any one time. This helps us with predictability and cost modeling because we can now know that each customer we provision in that environment will have the resources they were promised when they signed the contract.

We used our lab cloud environment this week to build two different customer cloud environments so that we could begin testing with the ACE this week. We think it will do what we want and now we just have to prove it out. I'll post more next week after we've completed the testing.

No comments:

Post a Comment