PayPay bas become the most popular QR code payments service in Japan. Launched on 5th October 2018, we are already passed 1 billion payments and as you can imagine, it has been quite a journey to match keep our systems growth on a par with the product’s popularity.
By looking at the rate of adding new-users over the past few months, it’s definitive to say that PayPay is growing at a tremendous rate. Whoever said that startups grow like Bamboo trees, was not kidding at all. (Bamboos can grow as much as 3 feet in 24 hours)
The first spurt in user growth was in December 2018, and with the help of subsequent campaigns and organic growth, we were at 15 million users within 12 months of launch. Now, we’re 25+ million users That’s right.. 25 million users in less than 1.5 years. That’s 19% of the population in Japan (126 million in 2018)
With the increasing number of users and number of merchants we support, our transactions rate per second has been also increasing steadily as well.
Compared to when we started back in 2018, our current average transactions per day is at least 50 times higher.
With all this growth, we realized the need to scale our system before our growth surpasses what our system can handle.
Kickstarting the journey to scale our system
The first major focus for us was to support 1000 TPS, a goal that came from the user-response from our 1st anniversary campaign on 5th October 2019. At that time, we supported a smaller TPS, but in the course of a single day our transaction volume increased 4x.
- This triggered a paradigm shift for our backend to figure out our max supported TPS and how to improve our design to support more than 1000 TPS.
The very first thing we needed was to see where the bottlenecks are and how much our system can support.
To achieve this goal, we used our “performance” environment (an exact replica of our production system) to see what our maximum supported TPS. Regular tests on the performance environment also helped us figuring out bottlenecks in our system.
Following a few days after we started our performance testing, we encountered the following challenges
- The workload on our performance and prod environment were not exactly similar
- The databases are similar spec but the data volume is different as our performance environment had erratic large volume tests
- Standard day to day operation metrics were not enough to drill deep into the application’s bottlenecks.
To achieve a similar workload as production, a dedicated team rebuilt our performance environment and designed scenarios, along with tooling to test our system performance. That itself deserves a separate Tech Blog from the team.
We used the tools at our disposal, to have more clarity and verify if those data points were enough.
We added custom Datadog metrics and used NewRelic’s distributed tracing to figure out the bottlenecks in our microservice architecture. Also, AWS Performance Insights was turned on to visualize the database performance.
If you haven’t read our intro to PayPay Backend feel free to go through it.
From the time when the above blog was written, we changed our managed database choice from RDS to Aurora.
Database as the biggest Bottleneck
Our backend is based on MicroService Architecture where each component is distributed and horizontally scalable. Or so we thought. 😅 After multiple testing, the bottleneck was found to be in our Payment System. The database workload for each payment was introducing huge latencies in our DB inserts and updates.
|Count (average per txn)||Average Latency (in ms)|
The average latency above is latency per Database Transaction as seen from the application side.
At the start of our testing, we were using Aurora r4.8xlarge for our Payment Service. If you are not familiar with AWS terms, r4.8xlarge is a beast of a machine with 32 vCPUs and more than 244 GB of memory.
With the performance insights enabled for our Aurora Instances, we started seeing why there was a performance hiccup for our workload.
Our Aurora instance was spending a lot of time waiting for the MySQL Binlog flush to finish.
For disaster recovery and other requirements, we needed binlog replication enabled in our system. Even though the binlog replication is semi synchronous, our workload was degraded to high and unstable latencies at around 600 TPS in our performance environment.
We fine tuned Aurora to use an instance type that provides better storage performance, which mitigated the bottleneck in flushing. However, we still only achieved a max TPS of 400. That is a 100% increase, but still 300% more to go.
For the next big jump in our performance improvement, stay tuned for the part 2 of this blog.
In part 2, we are going to cover our switch to a distributed database, and how we made sure the switch was smooth and robust.
- TPS: Transaction per Second
- performanceEnvironment: an exact replica of our production system
- DB: Database
- vCPUs: Virtual CPU
PayPay engineers are working on a whole range of exciting challenges as we change the future commerce in Japan. Are you interested in joining the team? Take a look at our recruitment site and get in touch.