Problem
MyJob.com needs to send real time jobs matching with resumes. Where volume of data is 25 MN resumes and 1 Lac jobs.
These matching jobs should be sent as notifications to any devices, Web and other social media.
Low Latency is the key requirement.
Solution
Our goal is to match resumes for jobs. A tagging scheme for jobs and resumes is the approach I would take. Any new job and resumes will have tags associated with them.
High level Search plan. Steps -
1. Get job
2. Get all tags for this job
3. Find associated resumes for these tags
4. Send emails to the job seekers
In order to create a low latency and highly scalable system, we need to create a loosely coupled platform in which Job creation, Resume creation and Job notification - all three systems should be developed as independent components. We will follow Service Oriented Architecture here.
Job updation system diagram
Tag creation system diagram
Job notification system diagram
Database schema design:
Assuming size of 1 resume = 100KB
We will store resumes in DB after compressing them. Any lossless compression algorithm such as LZW would compress text data by a factor of 8-10.
So, compressed data for 1 resume = 10KB approximately
Size needed to store 25MN = 25 * 10^6 * 10KB = 250GB
This would only be the size of resume text data.
Other than this, resume_details table contains 25MN rows. Size = 25GB
To support this load, we can use RDBMS MySQL 5.5.x.
Resume (Creating resume table with only email and resume text data. This will help us avoid expensive joins on this table) [25MN rows. Row size - 10KB. Max size - 250GB]
| id | email_id | resume_text | is_active | updated_on |
| 1 | sanjay.kv@gmail.com | compressed_cv1 | 1 | tr1 |
| 2 | nilanjan.roy@gmail.com |
compressed_cv2
| 0 | tr2 |
Resume_Tag (Contains mapping information. This will be used to find resume_text based on tag ids. Joins will be formed on this table.) [500MN rows. Row size - 100B. Max size - 50GB]
| id | email_id | resume_id [Links id from Resume table] | tag_id | updated_on |
| 1 |
sanjay.kv@gmail.com
| 1 | 1 | trt1 |
| 2 |
nilanjan.roy@gmail.com
| 2 |
2
| trt2 |
Job (Jobs table. Keeps summary and description information) [100,000 rows. Row size - 5KB. Table size - 500MB]
| id | summary | job_desc | is_active | updated_on |
| 1 | summary_txt1 | compressed_desc1 | 0 | tj1 |
| 2 |
summary_txt2
|
compressed_desc2
| 1 | tj2 |
Job_Tag (Contains mapping information. This will be used to find job_id based on tag ids. Joins will be formed on this table.) [1MN rows. Row size - 200B. Max size - 200MB]
| id | tag_id | job_id [Links id from Job table] | updated_on |
| 1 | 3 | 1 | tj1 |
| 2 | 2 | 1 | tj2 |
Tag (Contains exhaustive list of tags to match jobs and resumes) [Small table. Not more than 50MB]
| id | name | updated_on |
| 1 | IIT | tt1 |
| 2 | Engineering | tt2 |
| 3 | Manager | tt3 |
| 4 | Architect | tt4 |
| 5 | ecommerce | tt5 |
| 6 | NCR | tt6 |
| 7 | Yatra | tt7 |
| 8 | Java | tt8 |
| .. | .. | .. |
| .. | .. | .. |
| n | tag_n | ttn |
After this schema design:
We need to partition resume database horizontal partitioning / sharding based on email_id field in resume table).
As the total db size is approx 300GB.
resume - 250GB
resume_tag - 50GB
others < 1GB
We can make 18 partitions of resume table having 14-15GB each and based on hashing scheme:
resume table = hash(email)
Above hash function can be implemented in different ways. The way I approach it would be to distribute data on the basis on first alphabet frequency.
Following chart gives the frequency distribution of all alphabets:
So,
Table 1 - email addresses starting with 'e' stored in table - resume_e
Table 2 - email addresses starting with 't' stored in table - resume_t
Table 3 - email addresses starting with 'a' stored in table - resume_a
Table 4 - email addresses starting with 'o' stored in table - resume_o
Table 5 - email addresses starting with 'i' stored in table - resume_i
Table 6 - email addresses starting with 'n' stored in table - resume_n
Table 7 - email addresses starting with 's' stored in table - resume_s
Table 8 - email addresses starting with 'h' stored in table - resume_h
Table 9 - email addresses starting with 'r' stored in table - resume_r
Table 10 - email addresses starting with 'd' stored in table - resume_d
Table 11 - email addresses starting with 'l' stored in table - resume_l
Table 12 - email addresses starting with 'c', 'u' stored in table - resume_cu
Table 13 - email addresses starting with 'm','w' stored in table - resume_mw
Table 14 - email addresses starting with 'f','g' stored in table - resume_fg
Table 15 - email addresses starting with 'y','p' stored in table - resume_yp
Table 16 - email addresses starting with 'b','v' stored in table - resume_fg
Table 17 - email addresses starting with 'k','j','x','q','z' stored in table - resume_kjxqz
Table 18 - email addresses starting with numbers stored in table - resume_09
Why email_id and not resume_id?
Because we may have a use case, in which I need to fetch resume data based on email addresses.
If I partition based on resume_id, then I will have to scan all the partitioned tables to locate the resume.
With the above partitioning scheme, I can locate the table for email_id directly from application.