Since being jointly established by SoftBank, Yahoo! Japan, and Paytm in 2018, PayPay has become the most widely used QR code payment service in Japan. This success has depended on achieving a unique combination of extreme scale, speed, and reliability.
- Scale: PayPay now has almost 30 million registered users, about one fourth of Japan’s population. In one of our core OLTP transactional databases, there are more than 6 billion rows in one of the key tables used for payment processing.
- Speed: We launched our first product to users in just three months and, in the two years since, have released more than 100 updates to PayPay products, which are powered by almost 200 different microservices deployed in Kubernetes clusters.
- Reliability: Our mobile apps are consistently rated more than 4.5 stars in both Android and iOS app stores–in a culture where extreme quality is expected, even one second of downtime is unacceptable.
A key factor that has enabled us to devise creative solutions is our highly multicultural team, and the best-in-class tools and practices each culture has brought to the table.
Below, I describe the unique traits of several of our engineering cultures and how we successfully melded them into a high performance engineering organization. While I only have space to focus on a few cultures, I want to emphasize that many others have contributed to our success. Finally, I close with an example of a real solution we developed–our new deployment process–that exemplifies how our diverse team collaborates.
The defining traits of the Japanese engineering culture are an emphasis on quality and a strong sense of ownership and accountability. These traits have shaped the impression of PayPay as a reliable payment service among our customers.
Internet companies in Japan are held to a very high quality standard, which stems from the culture of kaizen (continuous improvement) that is closely associated with many industries in Japan. As one example, Internet services in most other countries, one or two minutes of system downtime might be totally acceptable or even worth bragging about if the system recovers by itself automatically. For our Japan-based users and engineering members however, even thius is unacceptable. They have set a very high quality standard, and it has pushed us to work in a completely different style, innovating to ensure we make zero mistakes while still delivering fast.
As PayPay matures into an established company, we need to earn user trust with our reliability.
We aspire to make PayPay as reliable as cash, which makes the Japanese emphasis on quality and continuous improvement all-important.
When PayPay was founded, our remit was to launch a payments service very quickly. As one of the founding partners, Paytm, the biggest mobile wallet company in India, contributed the core concept, value proposition, and product architecture. The Indian engineering culture emphasizes delivering value and product quickly to users, while focusing energy and effort on the most important tasks. This mindset helped us launch a product in just three months, and it was a resounding success. This results-driven mindset continues to help us deliver new, valuable features and functionality to users. And the Indian engineering culture also emphasizes that it’s okay to make small mistakes on the way to success — as long as users are getting net positive value and we are moving quickly.
Does it seem conflicting with the Japanese engineering culture? Not a problem — the beauty of diversity and the conscious trade off between different priorities are what really drive success.
A number of architects and engineers (including myself) from are from another of our partner companies, Paytm Canada, and have played a key role in the architectural design and initial implementation of the system. PayPay is still benefiting greatly from the defining Canadian/North American engineering culture traits: being technology-driven, doing things in a logical and scientific way, and automating as much as possible.
These engineers brought to PayPay a great deal of modern software development and infrastructure processes, tools, and methodologies. For example, our platform is entirely Kubernetes-based with an almost fully-automated CI/CD pipeline from day one. Looking back, we are very grateful for this design decision which enabled us to manage almost 200 microservices with a small human resource footprint. This team also introduced best-in-class tools like Datadog, which allows us to monitor our environment, detect and proactively prevent errors, and quickly remediate issues–a critical capability given our extreme reliability requirements.
When working on expanding our engineering teams, we are using hiring standards originating from Canada which focus on fundamental skills rather than technical details: algorithms, data structures, coding and problem solving skills, and culture fit. These have proven to be effective in finding the most capable and flexible engineers for PayPay’s future growth.
In addition to our member and partner teams, we can refer to the work of payment services around the world, learning about campaign systems, payment transaction processing methods, and more from mega-popular services like Alipay in China.
New deployment process enables extreme reliability and rapid releases at scale
Here’s one example of how the engineering cultures I’ve described work together to devise creative solutions: we recently introduced an innovative new deployment process that helps us simultaneously meet our reliability, speed, and scale objectives.
Canary deployments are a popular practice in which a change is released to a small subset of users, and monitored intensively before being deployed to all users. But, traditional canary deployments don’t work in Japan for the quality, user expectation, and regulatory reasons I described above (i.e. even a 1% failure/error is not acceptable).
To address this deployment challenge, we’ve created a completely separate production environment, called “Canary Production.” This environment has separate stateless components (e.g., a separate Kubernetes cluster) but shares the same stateful components (e.g., databases, third party APIs) with the production environment. This means that real money can be transacted. But the key thing is only PayPay employee accounts have access to Canary Production. Any time an employee wants to make a transaction using PayPay, the load balancer routes them to Canary Production.
Every new deployment is pushed to Canary Production before finally being pushed to Production. When a deployment is in Canary Production, we run hundreds of automation tests and closely monitor the environment with Datadog, including using automatic algorithmic alerts like anomaly detection. We can also gain feedback from PayPay employees who use the environment to make transactions in the real world. If anything is wrong, we have mechanisms to automatically roll back. And, to avoid dependency problems, we have an automatic mechanism to ensure that deployments can be made to only one of our 200 microservices at any one time. Also worth mentioning: this whole process is fully automated using Github Actions glueing multiple moving parts together.
The Canary Production solution brings together the best of our engineering cultures: it enables us to meet stringent quality standards while deploying new features frequently to deliver value quickly to customers. And it is possible because we have a highly flexible, automated, and scalable infrastructure with modern tooling and processes.
We pride ourselves on our multicultural engineering team at PayPay, and we see that the diversity of perspectives leads to innovative solutions and fantastic business results. We’re always seeking engineers who can help us push the limits of speed, scale, and reliability.
Read more about our team on the PayPay Inside-out.
About the author
Shilei Long is Senior Manager and Architect of Product Engineering at PayPay. Shilei looks after the overall design, implementation, and operation of the complex systems that are supporting millions of daily transactions. He has over 10 years of software development experience, including leadership roles at Paytm Canada and engineering roles at Amazon and Oracle.
For more information on PayPay and the tools we use, check out the following links: