When a business relies on data processing speed, every delay in service can result in lost revenue. Our client, a platform for launching advertising campaigns and analyzing their effectiveness, faced just such a problem.

Sudden timeouts, server crashes, and an overloaded database made it impossible for users to work. The situation required urgent intervention, but the cause of the failure was not obvious. We will share how we identified the bottleneck in the system, restored its functionality, and laid the groundwork for further optimization.

This case study will be beneficial for business owners planning to scale their services and IT directors making decisions about project architecture. It will help identify architectural bottlenecks and establish growth potential for the service during the design phase.

Customer

A platform for launching advertising campaigns and analyzing their effectiveness.

Task

In paid digital marketing, it’s important not only to launch ads but also to adjust the strategy in a timely manner if it’s not yielding results. The longer problems go unnoticed, the more budget is wasted. To avoid this, marketers use analytics services that help track campaign effectiveness in real-time and make data-driven decisions.

One such service approached us at SimbirSoft with a serious problem: their platform had started to noticeably slow down. At that time, the product was about four years old, and initially, it operated without issues. However, with the growth in user numbers and increasing data volumes, alarming signals began to emerge.

Users complained that their personal accounts were loading slowly, reports were taking too long to generate, and sometimes the process would abruptly end with Timeout, 500, 502, or 503 errors. In critical moments, the service could fail to load part of the data, forcing advertisers to wait too long for system responses. This meant they couldn’t quickly assess the effectiveness of their campaigns, disable non-performing ads in time, or reallocate budgets. Essentially, the money spent on advertising was simply going up in smoke.

The problem needed to be resolved quickly; otherwise, the service risked losing client trust and revenue. Such a service must operate 24/7; otherwise, all participants in the process would suffer: business owners, service users, end customers, and users of services. In particular, if an advertising campaign is functioning incorrectly, they may receive irrelevant ads or see them too frequently.

2.5 Months

Project Duration

100%

Reduction in User Complaints About the Service

How We Discovered the Problem

The first signals of a problem came from users: they complained about the personal account freezing and system malfunctions. Additional alarming indicators appeared on the “Server Load” dashboard, and resource consumption noticeably increased.

The backend development team began a detailed analysis using monitoring tools such as Grafana and request processing metrics (response time, error rates from the service, and the number of requests at that moment) as well as Jaeger (which allows for detailed tracking of request paths in a distributed system to identify bottlenecks and delays). Additionally, Zabbix was used to monitor network and hardware resources.

The problem was not immediately apparent; we needed to dig into the chain of service calls and analyze which service was under heavy load and why.

The first thing that stood out was the instability in request processing. Timeouts occurred on average once every 4–10 requests, and during peak times, the system operated erratically. Interestingly, there were fewer timeouts recorded on weekends, but the effect was unstable: the service operated across different time zones, and there was simply no classic "low-load time."

By analyzing metrics and tracing requests, we identified a bottleneck — a cascading slowdown of services. It was initially unclear what exactly was overloading the system: resources were being consumed too quickly, and timeouts appeared without any obvious pattern. It later became evident that one of the services in the architecture was impacting the entire system. It overloaded the main service, causing complex analytical queries to take longer than usual to execute.

Why the Problem Occurred

This problem remained unnoticed for a long time because, at the start, the client had little data in the database, and everything was functioning smoothly. However, over time, the load increased, and the system could no longer handle it as easily as before.

Upon investigating the bottleneck, we found that the root cause of the problem was an architectural oversight: an initial miscalculation in designing the service-oriented architecture (SOA) due to a lack of expertise in designing distributed systems, which led to a cascading slowdown of the system. When one of the services became overloaded, its delays accumulated, and this had a ripple effect on the performance of the entire platform. This underscores how crucial an architect's expertise is in system design—without it, one should not proceed.

In addition to architectural issues, there were also suboptimal aspects within the database. Specifically, there were missing proper indexes and inefficient data structures, which caused complex analytical queries to take longer than they could have.

Such situations are rare. Usually, system slowdowns occur gradually, giving the team time to scale. However, in this case, failures began to manifest suddenly, without a gradual decline, making this case quite atypical.

Solution

The problem needed to be addressed quickly and clearly. The client's expectations for the implementation timeline were 1 month. It ended up taking 2.5 months due to the analysis of the situation and the onboarding of a DevOps engineer. However, investing in the adaptation of the specialist was a key step in the project's success.

Resolving the issue required a phased approach to minimize risks and avoid additional failures. We worked as follows:

1. We deployed a replicated database that received real-time copies of data from the main service. This allowed us to test changes without affecting the operational system.

2. We set up a replica on a new node of the main service, followed by a copy of the problematic service. This created an isolated environment where we could safely test hypotheses and find the optimal way to distribute the load. The replicas of both applications were deployed on a separate service using NGINX.

3. We gradually redirected a portion of the traffic (10-20% of users) to the new application replica. This was necessary to observe how the system would behave under load and to ensure that the changes were indeed improving the situation.

4. Once testing confirmed the solution's effectiveness, we fully transitioned the traffic to the new isolated replicas. This allowed us to relieve the load on the main service and stabilize the system's operation without abrupt switches and risks for users.

During the discussion, it was initially suggested to use Kubernetes — a tool that automatically distributes load, monitors the status of services, and simplifies the deployment of updates. It could have significantly streamlined the updating of system components and management of replicas.

However, at that time, the services were operating without Kubernetes, and its implementation would require a serious restructuring of the infrastructure, which would take too much time. Due to the urgency of the problem, we had to abandon this option.

Result

To enable replication, we first created outlines and configured databases for load distribution. About a month was spent onboarding a DevOps engineer, setting up monitoring tools, and identifying issues. An additional 1.5 months was dedicated to implementing the project, followed by optimizing the performance of the database and the system as a whole.

After redirecting traffic to the isolated replicas, we conducted load testing on all requests from the problematic service. This helped identify discrepancies between the declared and actual data processing times. At this stage, we began to optimize specific problem areas.

In some cases, the solution was straightforward — adding indexes, modifying the database schema (for example, reducing normalization or changing the field type). In other cases, it required implementing caching for frequently used data to reduce the load on the database. In the most complex instances, we had to completely rewrite the business logic to eliminate inefficient queries.

We paid special attention to transactions. We reviewed areas where locking was used and either removed it if it was unnecessary or replaced it with Optimistic Concurrency Control (OCC) to avoid unnecessary delays in the system.

Plans for Development

We continued our work on optimizing the storage by adding data segmentation to improve performance.

After the changes were made, the service became more stable, and user complaints about errors and slow performance ceased. However, this was only a temporary solution that allowed us to quickly restore system functionality.

Our plans include a comprehensive audit of the architecture and its redesign. One of the key steps will be to separate critical sections into individual services that can ensure uninterrupted operation and reduce the risks of cascading failures. Currently, the architecture remains unchanged, except for temporary replication, but significant work on redesign is ahead.

Key Tasks for the Near Future

Refactoring the problematic service that overloaded the main service.
Optimizing code and stored procedures in the database.
Refactoring the general service to enhance its stability and performance.

We will pay special attention to implementing Rate Limiting—a mechanism to restrict the number of requests per second from a single service. This will help prevent system overload in the future and avoid a recurrence of similar issues. Combined with database optimization, these measures will ensure the stability of the service.

If you have encountered problems described in this case or similar ones and want to avoid them at the start of development—please read the checklist from our experts.

Other cases

Supporting HeadHunter

Warehouse Management System (WMS) Audit in 10 Days

Nanimatel: a Marketplace for Freelancers

Tochka

AlfaStrakhovanie

Mobile App for Yugoria Insurance Company

Magnit Delivery: IT System Quality Assurance

Designing a Mobile App for ViewEvo

Supporting HeadHunter

Warehouse Management System (WMS) Audit in 10 Days

Nanimatel: a Marketplace for Freelancers

Tochka

AlfaStrakhovanie

Mobile App for Yugoria Insurance Company

Magnit Delivery: IT System Quality Assurance

Designing a Mobile App for ViewEvo

Send us your request

Name or Organization

Phone or Email

Tell us about the project

Attach a file (up to 10MB)

File selected

Required extensions: .txt, .doc, .docx, .odt, .xls, .xlsx, .pdf, .jpg, .jpeg, .png

Maximum file size: 10 MB

I hereby confirm my consent to the processing of my personal data in accordance with Personal data protection and processing policy, SimbirSoft JSC

Projects

Our Workflow

Services

Our History

Insights

Locations

About SimbirSoft

Write Us

request@simbirsoft.com

Personal data protection and processing policy

Rescue of the Advertising Launch Platform from Critical Timeouts