Saturday, September 6, 2014

How does HTTPS really work?


HTTPS is combination of HTTP and SSL(Secure Socket Layer) to provide encrypted communication between client (browser) and web server (application is hosted here).
Why is it needed?
HTTPS encrypts data that is transmitted from browser to server over the network. So, no one can sniff the data during transmission.
How HTTPS connection is established between browser and web server?
  1. Browser tries to connect to the https://payment.com.
  2. payment.com server sends a certificate to the browser. This certificate includes payment.com server's public key, and some evidence that this public key actually belongs to payment.com.
  3. Browser verifies the certificate to confirm that it has the proper public key for payment.com.
  4. Browser chooses a random new symmetric key K to use for its connection to payment.com server. It encrypts K under payment.com public key.
  5. payment.com decrypts K using its private key. Now both browser and the payment server know K, but no one else does.
  6. Anytime browser wants to send something to payment.com, it encrypts it under K; the payment.com server decrypts it upon receipt. Anytime the payment.com server wants to send something to your browser, it encrypts it under K.
This flow can be represented by the following diagram: enter image description here

Architecture of a job portal

Problem
MyJob.com needs to send real time jobs matching with resumes. Where volume of data is 25 MN resumes and 1 Lac jobs.
These matching jobs should be sent as notifications to any devices, Web and other social media.
Low Latency is the key requirement.

Solution
Our goal is to match resumes for jobs. A tagging scheme for jobs and resumes is the approach I would take. Any new job and resumes will have tags associated with them.
High level Search plan. Steps - 
1. Get job
2. Get all tags for this job
3. Find associated resumes for these tags
4. Send emails to the job seekers

In order to create a low latency and highly scalable system, we need to create a loosely coupled platform in which Job creationResume creation and Job notification - all three systems should be developed as independent components. We will follow Service Oriented Architecture here.

Resume creation / updation system diagram (created using lucidchart - https://www.lucidchart.com)

Job updation system diagram


Tag creation system diagram

Job notification system diagram

Database schema design:
Assuming size of 1 resume = 100KB
We will store resumes in DB after compressing them. Any lossless compression algorithm such as LZW would compress text data by a factor of 8-10.
So, compressed data for 1 resume = 10KB approximately

Size needed to store 25MN = 25 * 10^6 * 10KB = 250GB
This would only be the size of resume text data.
Other than this, resume_details table contains 25MN rows. Size = 25GB

To support this load, we can use RDBMS MySQL 5.5.x.

Resume (Creating resume table with only email and resume text data. This will help us avoid expensive joins on this table) [25MN rows. Row size - 10KB. Max size - 250GB]
idemail_idresume_textis_activeupdated_on
1sanjay.kv@gmail.comcompressed_cv11tr1
2nilanjan.roy@gmail.com
compressed_cv2
0tr2


Resume_Tag (Contains mapping information. This will be used to find resume_text based on tag ids. Joins will be formed on this table.) [500MN rows. Row size - 100B. Max size - 50GB]
idemail_idresume_id [Links id from Resume table]tag_idupdated_on
1
sanjay.kv@gmail.com
11trt1
2
nilanjan.roy@gmail.com
2
2
trt2



Job (Jobs table. Keeps summary and description information) [100,000 rows. Row size - 5KB. Table size - 500MB]
idsummaryjob_descis_activeupdated_on
1summary_txt1compressed_desc10tj1
2
summary_txt2
compressed_desc2
1tj2


Job_Tag (Contains mapping information. This will be used to find job_id based on tag ids. Joins will be formed on this table.) [1MN rows. Row size - 200B. Max size - 200MB]
idtag_idjob_id [Links id from Job table]updated_on
131tj1
221tj2

Tag (Contains exhaustive list of tags to match jobs and resumes) [Small table. Not more than 50MB]
idnameupdated_on
1IITtt1
2Engineeringtt2
3Managertt3
4Architecttt4
5ecommercett5
6NCRtt6
7Yatratt7
8Javatt8
......
......
ntag_nttn


After this schema design: 
We need to partition resume database horizontal partitioning / sharding based on email_id field in resume table).

As the total db size is approx 300GB. 
resume - 250GB
resume_tag - 50GB
others < 1GB

We can make 18 partitions of resume table having 14-15GB each and based on hashing scheme:
resume table = hash(email)

Above hash function can be implemented in different ways. The way I approach it would be to distribute data on the basis on first alphabet frequency.
Following chart gives the frequency distribution of all alphabets:

So, 
Table 1 - email addresses starting with 'e' stored in table - resume_e
Table 2 - email addresses starting with 't' stored in table - resume_t
Table 3 - email addresses starting with 'a' stored in table - resume_a
Table 4 - email addresses starting with 'o' stored in table - resume_o
Table 5 - email addresses starting with 'i' stored in table - resume_i
Table 6 - email addresses starting with 'n' stored in table - resume_n
Table 7 - email addresses starting with 's' stored in table - resume_s
Table 8 - email addresses starting with 'h' stored in table - resume_h
Table 9 - email addresses starting with 'r' stored in table - resume_r
Table 10 - email addresses starting with 'd' stored in table - resume_d
Table 11 - email addresses starting with 'l' stored in table - resume_l
Table 12 - email addresses starting with 'c', 'u' stored in table - resume_cu
Table 13 - email addresses starting with 'm','w' stored in table - resume_mw
Table 14 - email addresses starting with 'f','g' stored in table - resume_fg
Table 15 - email addresses starting with 'y','p' stored in table - resume_yp
Table 16 - email addresses starting with 'b','v' stored in table - resume_fg
Table 17 - email addresses starting with 'k','j','x','q','z' stored in table - resume_kjxqz
Table 18 - email addresses starting with numbers stored in table - resume_09

Why email_id and not resume_id?
Because we may have a use case, in which I need to fetch resume data based on email addresses.
If I partition based on resume_id, then I will have to scan all the partitioned tables to locate the resume.
With the above partitioning scheme, I can locate the table for email_id directly from application.