Preparing Stars Arena’s Infrastructure for a Major User Increasing

Our client, Stars Arena, is a decentralized social platform (SocialFi) built on the Avalanche blockchain. It allows users to monetize their influence and interact with audiences through the purchase and sale of “tickets” — tokenized shares of user profiles. These tickets grant access to exclusive content such as private chats and posts.

Challenge:

Prepare the project infrastructure for a major product release and anticipated user surge. The target load was 2,000 RPS (requests per second), with a deadline of 2 weeks. The release was heavily promoted - it was a large-scale marketing event that had been planned well in advance and backed by significant investment, making any delays unacceptable. Our team was also responsible for monitoring the system and being ready to provide technical support in case anything went wrong.

Results:

Working closely with the development team, we increased the supported load from 100 RPS to 300 RPS within the first week, and ultimately to 2,000 RPS by the end of the second week. The actual number of users slightly exceeded projections. When the event went live, traffic briefly surpassed our expected peaks, but thanks to reserved capacity and thorough joint preparation by the DevOps and development teams, the launch proceeded without any manual intervention. The only action required was passive monitoring of load graphs and alert systems.

Timeline:

Q4 2024

Team:

Frontend Dev, Backend dev, 2 DevOps engineer

Stack and Tools:

  • Google Kubernetes Engine
  • K6
  • Cloudflare proxy
  • GCP Load Balancer
  • GCP Cloud Armor
  • GCP Logging

Approach:

Once we confirmed the event schedule, we began preparations. We identified the primary bottleneck in the system - database queries triggered by the frontend when loading the feed.

Our plan included the following steps:
  1. Analyze current database queries and identify the most resource-intensive.
  2. Ask the development team to optimize those queries.
  3. Make load testing.
  4. Analyze test results to locate and address bottlenecks and scaling limitations.
  5. Given the tight timeframe, prioritize fixing only critical issues.

Analysis showed that navigating to the homepage generated a particularly heavy and slow query. With just 100 concurrent users, this could lead to service degradation. This query was promptly handed off to the development team for refactoring and optimization.

Meanwhile, the DevOps engineers worked on the infrastructure side.

We upgraded the production cluster’s node pool, switching to more powerful node types. This allowed for greater resource allocation per pod. Autoscaling (HPA + node autoscaling) was already in place and functioning, enabling the system to automatically scale during peak loads. Additionally, we integrated Cloudflare proxying, which gave us rapid mitigation against potential DDoS attacks.

The final step was simulating expected user flows using K6, generating the planned load. Based on test results, we increased the available resources for the Kubernetes cluster and the database.

Outcome:

  • Developer-side optimizations were confirmed to be effective.
  • The infrastructure handled the target load with headroom to spare.
  • Autoscaling performed as expected.

During the actual event, user numbers slightly exceeded our expectations. However, the additional nodes provisioned by autoscaling handled the traffic flawlessly. The development and DevOps teams only observed metrics and alerts - the system functioned without any manual intervention.

More about our projects: