Saturday, April 2, 2016

SaaS Developer Guide - Part 2 - Scalability and Resiliency

This blog post is divided into two sections,
  • Background - Touches on definitions and calculations
  • Current Technology and Execution - Available technology, their features and Cloud Learner deployment
Background 

Distribution Systems has been around in software world for quite a while - since 1970s. It has been evolving, getting new terms, and becoming a key ingredient in Cloud based systems. Formally distributed systems are defined as a system consisting of components communicating over a network. Today all highly SaaS systems are designed as a horizontally scaling "set of services". A service is a evolution of a distributed component. If a well defined interface is given to a component in a distributed system it becomes an interface.

Why Services?
  • Services can be developed, deployed and upgraded independantly
  • Services can be scaled and made highly available independently. More replication for important services. More nodes for services with higher load
  • Allows to use different technologies
  • Strong boundaries for modularity

In the Cloud Learner application, there is going to be several services,
  • Class CRUD service
  • UserMgt service
  • Subscription service
  • Course CRUD service
  • Front end component


Load balanced Nodes

When the load goes high on the "Class CRUD service" we can have two instances of the "Class CRUD service" to serve the requests. The load balancer will distribute the requests between these nodes.



"Elasticity" or "auto-scaling", is the primary scalable ingredient of  a SaaS. It means the system scales up as the load goes high and it scales down as the load becomes low. This allows a SaaS to exploit the full potential of the underlying IaaS/PaaS, which has the ability to provision infrastructure on-demand. Elasticity leads to optimized resource consumption that results in economy of scale. We'll be discussing about "economy of scale" in length in a future blog.

Resiliency and scalability go hand-in-hand, in a large distributed system. For example one of the primary methods of providing resiliency is detecting faulty node and removing it from cluster while and creating another instance for it. If one instance of the component "A" goes down, the performance of that module degrades by (n-1) where n is number of nodes in the module. But the system will continue to function. This is a key feature of resilient architecture.

              Percentage performance hit on a cluster = (n-&)/n * 100%
       where,
              n is the total number of nodes
              & is the faulty number of nodes

When n is large, the impact of loosing a single node shows less performance hit. But what happens when n is small? For example if the number of nodes are 2 then, 1 node malfunctioning is a critical factor.




Everybody starts starts small. So as an initial SaaS provider with a cluster of 2 nodes that are fully utilized by the load, adding an additional node in active mode would give n+1 resilient factor. When it comes to Cloud Learner application only one node is enough to manage subscriptions, but for resilience another node is added.




Current Technology and Execution

Load is predicted and tested, calculations are done, it is important to glance at the current cloud landscape to understand the "execution" aspects of achieving scalability and resilience. The line between PaaS and IaaS are blurring as IaaS providers keep on adding PaaS level features. Following public cloud features can be leveraged for scalability and resiliency.
  • Load balancing
  • Health check
  • Rule based auto scaling
  • Rule based routing
  • Availability zones 
For example look at the functionalities provided by GCloud for auto-scaling.

Now that is the difference of "going cloud all the way". The sample application can be deployed with scalability and resilience in on a PaaS as follows. There are two FE components. All nodes have n+1 resilient factor. Here the ClassCRUD services is deployed separately because it's load factor is higher than other services. ClassCRUD service the most important service, in the Cloud Learner application as it serves 90% of the user actions. Therefore it has 3 nodes.




No comments: