Pure Brew engineers have a proven track record of working with Scala and its ecosystem for the past 10 years. They have experience in developing applications using technologies such as Akka cluster, sharding, persistence, and running services, as well as making minor contributions to open-source projects. Due to our expertise, a client reached out for assistance with their existing system.
The clients company has a large-scale, distributed system built using the Akka framework. The system processes a high volume of requests and events, and client have observed that it experiences significant performance issues under heavy load and sometimes nodes in the cluster turn completely unresponsive.
The objective was to fine-tune the Akka configuration to improve the performance, scalability, and reliability of the service. An important part of the review of the service was identification of the bottle necks in the system. The main goals were to reduce latency, increase throughput, and ensure that the system was able to handle increasing workloads all by providing measuarable improvements.
Team and Project Setup
The project started with a 2-day workshop with the client's engineers. During the workshop, the client's engineers provided an overview of the system, including its architecture, components, and key challenges. This gave us a good understanding of the system and allowed us to identify potential areas for improvement. After the workshop, we prepared a plan for the optimization project, which included a review of the source code, JVM profiling, and instrumentation using Kamon and performance testing. We requested access to the source code and received it in a timely manner, which allowed us to start the review process. Once we had access to the source code, we conducted a thorough code review, which included a review of the Akka system configuration and actor hierarchy, as well as a review of the JVM configurations. We also ran performance tests locally and against the environment to understand the system's behavior under different workloads.
The Project has been done in a fully remote manner. Despite the remote nature of the project, we were able to work closely with the client's engineers and gain a deep understanding of the system through the initial engagement and review process.
Areas of Focus
By focusing on these areas, the team was able to gain valuable insights into the system's behavior and performance and verify the proposed improvements before deploying to the production system.
To gain visibility into the system's performance, we added instrumentation capabilities using Kamon and kamon.io. The client was already using Datadog for monitoring of their production systems. Kamon is an open-source monitoring and tracing library for Akka and Akka HTTP-based applications. It allows us to track various performance metrics such as actor and message counts, message processing time, and actor lifecycle events. We used Kamon to instrument the Akka actor system, which allowed us to collect detailed metrics on the system's actors, messages, and thread usage. In addition to Kamon, we also used kamon.io, which is a cloud-based monitoring and observability platform that provides additional functionality such as alerting, anomaly detection, and visualization. By sending the metrics collected by Kamon to kamon.io, we were able to gain a comprehensive view of the system's performance over time, which helped us to run the performance tests and aggregate the metrics. Using Kamon and kamon.io, we were able to gain valuable insights into the system's behavior, which helped us to optimize the Akka actor system for performance and scalability. For example, we were able to identify which actors were handling the most messages and which actors were experiencing the most contention, which allowed us to optimize the actor hierarchy and improve performance.
Configuration & Testing
It was tricky to simulate the production load so we initially setup a baseline local throughput of the system without any changes to the configuration or code. Then we had to validate the proposed changes by running the performance tests and observe the behavior of the system while using different Akka and JVM configurations. By comparing the results of the tests with different configurations, we were able to determine the optimal configuration for the system. We then fine-tuned the Akka configuration to improve performance. We increased the number of threads in the thread pool, which allowed for better parallelism and reduced contention. We also increased the number of actors in the system, which allowed for better distribution of work. We ran performance tests locally and against a cloud environment to identify any issues. We used Gatling to simulate a high volume of requests and events in the system.
Profiling the JVM
In order to identify and resolve performance issues with the system, we profiled the JVM using jvisualvm to gain insights into its behavior. We used jvisualvm to profile the JVM's heap and stack memory usage, as well as to track the number of objects and classes in the heap. This allowed us to identify any memory leaks or areas of high memory usage that could be causing performance issues. We also used jvisualvm to monitor the JVM's garbage collection performance. This allowed us to identify any potential issues with the way the JVM is handling memory, and make adjustments to the JVM configuration and parameters to improve performance. We tested different JVM configurations, such as increasing the heap size and fine-tuning the GC settings, to improve performance. We also tested different Garbage collectors like G1, CMS, Parallel and Shenandoah and found that different GCs have different behavior in different scenarios, for example, G1 GC performed better under heavy load. We also testsed the system running on newer java versions, which came with different GC algorithms that impacted the performance of the system.
As part of optimizing the system, we conducted a thorough code review of the service, with a focus on the Akka system configuration and actor hierarchy. We reviewed the configuration of the system to ensure that it was optimized for performance and scalability. We looked for any issues with the thread pool size, actor creation, and message passing that could be causing performance bottlenecks. We also reviewed the actor hierarchy to ensure that it was designed in a way that would support the desired level of concurrency and scalability. We looked for any issues with the way actors were being created, managed, and destroyed, as well as the way messages were being passed between actors. We also evaluated if the number of actors used is optimal, and if the actors are handling the correct amount of work. We also reviewed the code for any unnecessary complexity, which could negatively impact performance, and for code duplications. We identified and refactored any problematic code to improve the performance, scalability, and maintainability of the system. By conducting a thorough code review, we were able to identify and resolve any issues with the Akka system configuration and actor hierarchy, which resulted in significant performance gains.
After implementing optimizations, we observed a significant improvement in the system's performance and scalability. The system was able to handle a higher volume of requests and events without encountering any performance issues. We also successfully resolved the main performance issue that was occurring in the production system, which was the primary objective. This led to a closer collaboration with the client for an additional six months, during which the team provided consultations for various parts of the system.