Wednesday, September 21, 2011


Introduction to Web Services

Web Services can convert your applications into Web-applications.
Web Services are published, found, and used through the Web.

What are Web Services?

  • Web services are application components
  • Web services communicate using open protocols
  • Web services are self-contained and self-describing
  • Web services can be discovered using UDDI
  • Web services can be used by other applications
  • XML is the basis for Web services

How Does it Work?

The basic Web services platform is XML + HTTP.
XML provides a language which can be used between different platforms and programming languages and still express complex messages and functions.
The HTTP protocol is the most used Internet protocol.
Web services platform elements:
  • SOAP (Simple Object Access Protocol)
  • UDDI (Universal Description, Discovery and Integration)
  • WSDL (Web Services Description Language)

Web Services have three basic platform elements: SOAP, WSDL and UDDI.

What is SOAP?

SOAP is an XML-based protocol to let applications exchange information over HTTP.
Or more simple: SOAP is a protocol for accessing a Web Service.
  • SOAP stands for Simple Object Access Protocol
  • SOAP is a communication protocol
  • SOAP is a format for sending messages
  • SOAP is designed to communicate via Internet
  • SOAP is platform independent
  • SOAP is language independent
  • SOAP is based on XML
  • SOAP is simple and extensible
  • SOAP allows you to get around firewalls
  • SOAP is a W3C standard
Read more about SOAP on our Home page.

What is WSDL?

WSDL is an XML-based language for locating and describing Web services.
  • WSDL stands for Web Services Description Language
  • WSDL is based on XML
  • WSDL is used to describe Web services
  • WSDL is used to locate Web services
  • WSDL is a W3C standard
Read more about WSDL on our Home page.

What is UDDI?

UDDI is a directory service where companies can register and search for Web services.
  • UDDI stands for Universal Description, Discovery and Integration
  • UDDI is a directory for storing information about web services
  • UDDI is a directory of web service interfaces described by WSDL
  • UDDI communicates via SOAP
  • UDDI is built into the Microsoft .NET platform
Read more about UDDI in our WSDL Tutorial.


Scalable System Design Patterns

Looking back after 2.5 years since my previous post on scalable system design techniques, I've observed an emergence of a set of commonly used design patterns. Here is my attempt to capture and share them.

Load Balancer

In this model, there is a dispatcher that determines which worker instance will handle the request based on different policies. The application should best be "stateless" so any worker instance can handle the request.

This pattern is deployed in almost every medium to large web site setup.



Scatter and Gather

In this model, the dispatcher multicast the request to all workers of the pool. Each worker will compute a local result and send it back to the dispatcher, who will consolidate them into a single response and then send back to the client.

This pattern is used in Search engines like Yahoo, Google to handle user's keyword search request ... etc.



Result Cache

In this model, the dispatcher will first lookup if the request has been made before and try to find the previous result to return, in order to save the actual execution.

This pattern is commonly used in large enterprise application. Memcached is a very commonly deployed cache server.



Shared Space

This model also known as "Blackboard"; all workers monitors information from the shared space and contributes partial knowledge back to the blackboard. The information is continuously enriched until a solution is reached.

This pattern is used in JavaSpace and also commercial product GigaSpace.



Pipe and Filter

This model is also known as "Data Flow Programming"; all workers connected by pipes where data is flow across.

This pattern is a very common EAI pattern.



Map Reduce

The model is targeting batch jobs where disk I/O is the major bottleneck. It use a distributed file system so that disk I/O can be done in parallel.

This pattern is used in many of Google's internal application, as well as implemented in open source Hadoop parallel processing framework. I also find this pattern can be used in many many application design scenarios.



Bulk Synchronous Parellel

This model is based on lock-step execution across all workers, coordinated by a master. Each worker repeat the following steps until the exit condition is reached, when there is no more active workers.

  1. Each worker read data from input queue
  2. Each worker perform local processing based on the read data
  3. Each worker push local result along its direct connection
This pattern has been used in Google's Pregel graph processing model as well as the Apache Hama project.



Execution Orchestrator

This model is based on an intelligent scheduler / orchestrator to schedule ready-to-run tasks (based on a dependency graph) across a clusters of dumb workers.

This pattern is used in Microsoft's Dryad project



Although I tried to cover the whole set of commonly used design pattern for building large scale system, I am sure I have missed some other important ones. Please drop me a comment and feedback.

Also, there is a whole set of scalability patterns around data tier that I haven't covered here. This include some very basic patterns underlying NOSQL. And it worths to take a deep look at some leading implementations.

Database Scalability


Database Scalability

Database is typically the last piece of the puzzle of the scalability problem. There are some common techniques to scale the DB tire

Indexing

Make sure appropriate indexes is built for fast access. Analyze the frequently-used queries and examine the query plan when it is executed (e.g. use "explain" for MySQL). Check whether appropriate index exist and being used.

Data De-normalization

Table join is an expensive operation and should be reduced as much as possible. One technique is to de-normalize the data such that certain information is repeated in different tables.

DB Replication


For typical web application where the read/write ratio is high, it will be useful to maintain multiple read-only replicas so that read access workload can be spread across. For example, in a 1 master/N slaves case, all update goes to master DB which send a change log to the replicas. However, there will be a time lag for replication.


Table Partitioning

You can partition vertically or horizontally.

Vertical partitioning is about putting different DB tables into different machines or moving some columns (rarely access attributes) to a different table. Of course, for query performance reason, tables that are joined together inside a query need to reside in the same DB.

Horizontally partitioning is about moving different rows within a table into a separated DB. For example, we can partition the rows according to user id. Locality of reference is very important, we should put the rows (from different tables) of the same user together in the same machine if these information will be access together.






Transaction Processing

Avoid mixing OLAP (query intensive) and OLTP (update intensive) operations within the same DB. In the OLTP system, avoid using long running database transaction and choose the isolation level appropriately. A typical technique is to use optimistic business transaction. Under this scheme, a long running business transaction is executed outside a database transaction. Data containing a version stamp is read outside the database trsnaction. When the user commits the business transaction, a database transaction is started at that time, the lastest version stamp of the corresponding records is re-read from the DB to make sure it is the same as the previous read (which means the data is not modified since the last read). Is so, the changes is pushed to the DB and transaction is commited (with the version stamp advanced). In case the version stamp is mismatched, the DB transaction as well as the business transaction is aborted.

Object / Relational Mapping
Although O/R mapping layer is useful to simplify persistent logic, it is usually not friendly to scalability. Consider the performance overhead carefully when deciding to use O/R mapping.

There are many tuning parameters in O/R mapping. Consider these ...
  • When an object is dereferenced, how deep the object will be retrieved
  • If a collection is dereferenced, does the O/R mapper retrieve all the object contained in the collection ?
  • When an object is expanded, choose carefully between multiple "single-join" queries and single "multiple join" query