Talent.com, founded in 2011, offers a unified job search platform covering 75+ countries, 30M+ job listings, and various languages and industries. It collaborates with AWS to develop a job recommendation engine using deep learning. The large-scale data processing pipeline handles JSON Lines from S3, extracting and refining features for the recommendation engine. The pipeline significantly shortened the time needed to deploy the ML pipeline to production.
Solution Overview
Talent.com, in collaboration with AWS, has built a cutting-edge job recommendation engine using Amazon SageMaker. This engine is capable of handling over 30 million job listings from various sources and employs deep learning techniques to provide personalized job recommendations to users. To facilitate the processing of this extensive amount of data, a three-phase ETL (extract, transform, and load) pipeline has been developed, leveraging Amazon SageMaker Processing, AWS Glue, Amazon Athena, and Python libraries for efficient feature extraction and data management.
Phase 1: Process Raw JSONL Files
The pipeline utilises Amazon SageMaker Processing jobs to handle raw JSONL files associated with specific days, performing feature extraction and data compaction. By parallelising the processing of each JSONL file, the pipeline ensures efficient extraction and compaction, ultimately saving the processed features into Parquet files and uploading them to Amazon S3. This enables efficient crawling and SQL queries in subsequent pipeline stages.
Phase 2: Crawl Processed Data Using AWS Glue
Once the raw data for multiple days has been processed, an Athena table is created using an AWS Glue crawler. This step allows for the creation of a table from the processed data, providing seamless management of large volumes of features for subsequent model training.
Phase 3: Load Processed Features for Training
Processed features for a specified date range are loaded from the Athena table using SQL, enabling seamless integration with the training of the job recommender model. The solution simplifies these tasks and allows for quick path-to-production for both Data Scientists and ML Engineers.
Solution Benefits
The implemented solution offers multiple advantages, including simplified implementation, quick path-to-production, reusability, efficiency, and support for incremental updates. It enables Talent.com to process large volumes of data, leveraging the ETL pipeline to create training data and deploy the recommendation system into production within a short timeframe. Ultimately, the solution has led to significant improvements in performance, including an 8.6% increase in clickthrough rate in A/B testing, highlighting its tangible impact on connecting users with relevant job opportunities.
Conclusion
The ETL pipeline outlined in this post has played a crucial role in enabling Talent.com to build and deploy their job recommendation system efficiently. Using Amazon SageMaker Processing jobs, the pipeline has streamlined feature extraction and provided the necessary infrastructure for developing and deploying ML models at scale. The authors encourage readers to explore the potential of this pipeline and its applicability to various use-cases, emphasising its reusability and efficiency in streamlining AI and ML workflows.
About the Authors
The team contributing to this solution includes experts from both Amazon Machine Learning Solutions Lab and Talent.com, bringing a wealth of experience in AI, machine learning, and technology solutions. Their collaborative efforts have resulted in a practical and impactful AI solution that significantly benefits Talent.com’s workforce connections and user engagement.
Spotlight on a Practical AI Solution
Discover the AI Sales Bot from itinai.com/aisalesbot, designed to automate customer engagement 24/7 and manage interactions across all customer journey stages. Explore how AI can redefine your sales processes and customer engagement, unlocking new opportunities for business growth and customer satisfaction.