Creation of an IT Data Processing System for the Genealogical Sector
Design, development, and support of an internal automated system for creating configurable and production-ready data processing pipelines with integrated machine learning models. The solution ensures scalability, fault tolerance, and management of the complete data lifecycle — from ingestion and preprocessing to inference and monitoring.
Customer
The client is a company specializing in genealogical research, processing, and providing access to historical archival data.
Task
The SimbirSoft team faced the following challenges:
-
Minimize errors in the integration processes of ML models,
-
Reduce manual labor,
-
Shorten the time to market for ML models,
-
Scale the infrastructure,
-
Organize monitoring and alert mechanisms.
Solution
1. Minimization of Errors in ML Model Integration Processes:
Templates were developed for generating data processing pipelines. Docker containers were used to isolate models with different dependencies, reducing the risks of conflicts and deployment errors. The pipelines were designed to handle images and document scans and support high-load operations: preprocessing, OCR, and GPU computations.
2. Reduction of Manual Labor through Automation of ETL/ML Processes:
The pipelines are built on Python applications, allowing for flexible integration of both ready-made ML libraries and custom logic. This approach significantly reduced the volume of manual operations in data processing and model handling.
3. Reduction of Time-to-Market:
To shorten the implementation timelines and speed up development, integration with AWS services was implemented, including AWS SageMaker. The use of AWS services enabled dynamic scaling.
Additionally, monitoring was organized to track performance drops and processing errors.
Result
-
Speed of Implementation: Deployment time for new models was reduced from weeks to hours.
-
Resource Savings: Cost optimization for computations through automatic scaling in AWS.
-
Reliability: Fault tolerance when processing millions of images.
Thanks to the developed automated ML pipeline system, we also participated in the development of an ML solution for extracting data from historical handwritten texts.
Challenges
During the project implementation, the team successfully addressed issues related to the increased complexity of integrating custom ML models in the absence of widely accepted standards.
Technologies
-
AWS (SQS, SNS, EC2, Lambda, S3, SageMaker, ASG, etc.)
-
Python
-
Terraform
-
Jenkins
-
Docker
-
Harness
-
BentoML