Data Warehouse (DWH) Design for a Large IT Company
We created an analytical environment to automate the collection, processing, and visualization of information. This included designing a Data Warehouse (DWH), configuring ETL processes in Airflow, and developing dashboards in Superset.
Customer
Large IT company.
Task
To automate internal work processes, the client needed to create an analytical environment, a set of software tools that would simplify the collection, processing, and visualization of data.
Business Goals
-
Obtain a tool for quick data collection and analysis.
-
Create a single reliable source of data for all users.
-
Train users to work with data analytics systems.
Technical goals
- Design a scalable DWH architecture.
- Automate the loading and transformation of data from various sources.
- Ensure data security through a role‑based access model.
- Optimize query performance for large data volumes.
Solution
1. Analysis and Design
In three weeks, we analyzed the infrastructure of other companies in the market, researched the necessary IT products in the system, and studied the market for the most effective solutions. Based on the collected data, we prepared a technical specification for DevOps engineers to deploy the required infrastructure.
We selected the optimal solutions: a database management system (DBMS), ETL tools, BI systems, and additional software products. We chose Greenplum for storing historical data and ClickHouse for real‑time analytics.
ClickHouse is currently one of the fastest analytical systems, reducing query execution time severalfold.
2. ETL Process Development
We developed ETL (Extract, Transform, Load) processes to extract data from various sources, transform it into a format suitable for analytics, and load it into the target system.
Over the next six weeks, we created more than 15 DAGs in Airflow. DAGs, or Directed Acyclic Graphs, define the logic and sequence of extracting, transforming, and loading data. For complex transformations, we implemented data processing using Python (Pandas) and set up monitoring and failure notifications.
Automation through ETL reduced errors from manual data processing to zero.
3. Visualization
Over the following four weeks, we developed and refined dashboards for key metrics in Superset.
4. Implementation
In two weeks, we conducted four training webinars for more than 20 employees and set up role‑based access to the system.
Result
The client received a ready‑to‑use analytical system designed to work with various data formats and sources.
Services provided:
-
Gathering requirements for the analytics platform
-
Designing the DWH architecture (Greenplum, ClickHouse)
-
Developing ETL pipelines in Airflow and writing DAGs in Python
-
Training the team to work with the new tools
-
Implementing version control with GitLab
-
Developing interactive dashboards in Superset
-
Integrating with external APIs
-
Creating data marts and optimizing SQL queries
SimbirSoft project team included:
-
6 Data Analysts
-
2 System Analysts
-
2 DWH Analysts
-
-
1 DevOps Engine
Technologies
- Databases: Greenplum, ClickHouse
- ETL: Airflow (Python), Pandas
- Visualization: Superset
- Infrastructure: Docker, k9s
- Version control: GitLab
- IDE: PyCharm, DBeaver