transcripts: September 2014

Problem

MyJob.com needs to send real time jobs matching with resumes. Where volume of data is 25 MN resumes and 1 Lac jobs.
These matching jobs should be sent as notifications to any devices, Web and other social media.
Low Latency is the key requirement.

Solution

Our goal is to match resumes for jobs. A tagging scheme for jobs and resumes is the approach I would take. Any new job and resumes will have tags associated with them.

High level Search plan. Steps -

1. Get job

2. Get all tags for this job

3. Find associated resumes for these tags

4. Send emails to the job seekers

In order to create a low latency and highly scalable system, we need to create a loosely coupled platform in which Job creation, Resume creation and Job notification - all three systems should be developed as independent components. We will follow Service Oriented Architecture here.

Resume creation / updation system diagram (created using lucidchart - https://www.lucidchart.com)

Job updation system diagram

Tag creation system diagram

Job notification system diagram

Database schema design:

Assuming size of 1 resume = 100KB

We will store resumes in DB after compressing them. Any lossless compression algorithm such as LZW would compress text data by a factor of 8-10.

So, compressed data for 1 resume = 10KB approximately

Size needed to store 25MN = 25 * 10^6 * 10KB = 250GB

This would only be the size of resume text data.

Other than this, resume_details table contains 25MN rows. Size = 25GB

To support this load, we can use RDBMS MySQL 5.5.x.

Resume (Creating resume table with only email and resume text data. This will help us avoid expensive joins on this table) [25MN rows. Row size - 10KB. Max size - 250GB]

id	email_id	resume_text	is_active	updated_on
1	sanjay.kv@gmail.com	compressed_cv1	1	tr1
2	nilanjan.roy@gmail.com	compressed_cv2	0	tr2

Resume_Tag (Contains mapping information. This will be used to find resume_text based on tag ids. Joins will be formed on this table.) [500MN rows. Row size - 100B. Max size - 50GB]

id	email_id	resume_id [Links id from Resume table]	tag_id	updated_on
1	sanjay.kv@gmail.com	1	1	trt1
2	nilanjan.roy@gmail.com	2	2	trt2

Job (Jobs table. Keeps summary and description information) [100,000 rows. Row size - 5KB. Table size - 500MB]

id	summary	job_desc	is_active	updated_on
1	summary_txt1	compressed_desc1	0	tj1
2	summary_txt2	compressed_desc2	1	tj2

Job_Tag (Contains mapping information. This will be used to find job_id based on tag ids. Joins will be formed on this table.) [1MN rows. Row size - 200B. Max size - 200MB]

id	tag_id	job_id [Links id from Job table]	updated_on
1	3	1	tj1
2	2	1	tj2

Tag (Contains exhaustive list of tags to match jobs and resumes) [Small table. Not more than 50MB]

id	name	updated_on
1	IIT	tt1
2	Engineering	tt2
3	Manager	tt3
4	Architect	tt4
5	ecommerce	tt5
6	NCR	tt6
7	Yatra	tt7
8	Java	tt8
..	..	..
..	..	..
n	tag_n	ttn

After this schema design:

We need to partition resume database horizontal partitioning / sharding based on email_id field in resume table).

As the total db size is approx 300GB.

resume - 250GB

resume_tag - 50GB

others < 1GB

We can make 18 partitions of resume table having 14-15GB each and based on hashing scheme:

resume table = hash(email)

Above hash function can be implemented in different ways. The way I approach it would be to distribute data on the basis on first alphabet frequency.

Following chart gives the frequency distribution of all alphabets:

So,

Table 1 - email addresses starting with 'e' stored in table - resume_e

Table 2 - email addresses starting with 't' stored in table - resume_t

Table 3 - email addresses starting with 'a' stored in table - resume_a

Table 4 - email addresses starting with 'o' stored in table - resume_o

Table 5 - email addresses starting with 'i' stored in table - resume_i

Table 6 - email addresses starting with 'n' stored in table - resume_n

Table 7 - email addresses starting with 's' stored in table - resume_s

Table 8 - email addresses starting with 'h' stored in table - resume_h

Table 9 - email addresses starting with 'r' stored in table - resume_r

Table 10 - email addresses starting with 'd' stored in table - resume_d

Table 11 - email addresses starting with 'l' stored in table - resume_l

Table 12 - email addresses starting with 'c', 'u' stored in table - resume_cu

Table 13 - email addresses starting with 'm','w' stored in table - resume_mw

Table 14 - email addresses starting with 'f','g' stored in table - resume_fg

Table 15 - email addresses starting with 'y','p' stored in table - resume_yp

Table 16 - email addresses starting with 'b','v' stored in table - resume_fg

Table 17 - email addresses starting with 'k','j','x','q','z' stored in table - resume_kjxqz

Table 18 - email addresses starting with numbers stored in table - resume_09

Why email_id and not resume_id?

Because we may have a use case, in which I need to fetch resume data based on email addresses.

If I partition based on resume_id, then I will have to scan all the partitioned tables to locate the resume.

With the above partitioning scheme, I can locate the table for email_id directly from application.

transcripts

Saturday, September 6, 2014

How does HTTPS really work?

Architecture of a job portal

Blog Archive

About Me