Once we confirmed the event schedule, we began preparations. We identified the
primary bottleneck in the system - database queries triggered by the frontend when loading the feed.
Our plan included the following steps:
- Analyze current database queries and identify the most resource-intensive.
- Ask the development team to optimize those queries.
- Make load testing.
- Analyze test results to locate and address bottlenecks and scaling limitations.
- Given the tight timeframe, prioritize fixing only critical issues.
Analysis showed that navigating to the homepage generated a particularly
heavy and slow query. With just 100 concurrent users, this could lead to service degradation. This query was promptly handed off to the development team for refactoring and optimization.
Meanwhile, the DevOps engineers worked on the infrastructure side.
We
upgraded the production cluster’s node pool, switching to more powerful node types. This allowed for greater resource allocation per pod. Autoscaling (HPA + node autoscaling) was already in place and functioning, enabling the system to automatically scale during peak loads. Additionally, we integrated
Cloudflare proxying, which gave us rapid mitigation against potential DDoS attacks.
The final step was simulating expected user flows using
K6, generating the planned load. Based on test results, we increased the available resources for the Kubernetes cluster and the database.