In this article, we’ll explore the motivations behind our decision to undertake the migration of PayPay point flows out of the wallet domain and outline how we accomplished it.

Background

PayPay points is a set of money and flows in PayPay that are mostly associated with but not limited to PayPay’s cashback system. In 2023, we launched a feature that enables split payments in PayPay. During this development, we recognized the potential to further segregate the wallet domain, aligning with the overall direction of that project. This decision was driven by the realization that the point aspect of the wallet domain is evolving into a standalone payment method, distinct from the wallet payment methods. Additionally, this separation will lessen the complexity within the wallet domain, enhancing the development experience and agility of our engineering team—particularly in relation to campaign and point flows.

Existing Architecture

We are going to explain a little bit about the architecture that we had before the migration, along with some background information about the wallet domain architecture.

Wallet & Accounts

A user has 1 wallet in a 1 to 1 mapping. A wallet can then have multiple accounts in it. Point account is one of them, although we used to call it CASHBACK. Not necessarily accurate given the current scope of points because cashback is not the only way to obtain points anymore.

Money Movements

Transactions in the wallet domain are based on double-entry accounting. Meaning every debit entry has a corresponding credit entry. Internally, we just call these entries as money movements for short. An overly simplified transaction entry looks like the table below.

id	src_user	dst_user	src_account	dst_account	amount
1	123	456	Account A	CASHBACK	100

2PC API Design

We adopt a 2 phase commit (2PC for short) API design when initiating transactions in the wallet domain. This means that a transaction has to go through two phases to complete, and can be rolled back if interrupted. For example, a CASHBACK payment transaction requires two API calls:

/payment/prepare
/payment/commit

If the transaction needs to be canceled, the /payment/rollback API is called. The final accounting entry, crediting the merchant’s account, is only created after a successful commit.

Motivations

It became clear that the wallet domain had become overloaded with responsibilities, making it challenging for the team to focus on essential technical enablers and improvements—particularly given the wallet domain’s role as a foundational component of PayPay.

Both wallet and point related features were getting increasingly difficult to work on due to the massive overhead from project and code conflicts. The wallet domain had essentially become a monolith that contained 2 completely different domains in one. An idea arose that they can be handled separately if we can do a proper migration.
Tech excellence projects and system improvements were getting increasingly difficult to work on due to the coupling of the wallet and point flows. They gradually became different entities, behaving and achieving different things, but they needed to share the same building blocks.
Infrastructures could not be scaled individually even though the needs for the wallet and point flows were not the same. The same databases and application pods were used together for completely different use cases.

Objectives

Given the challenges within the wallet domain, we began to define our goals. Our first step was to identify the key issues we needed to address and determine the best approach to resolve them. We focused on migrating the point-related flows out of the wallet domain.

Basic Requirements

We believed that moving point-related flows to their own domain could address many of our pain points. While this concept seems straightforward, it becomes quite complex when considering our existing architecture and the scale at which PayPay operates. Essentially, we needed to extract all point-related flows from the wallet domain, including CASHBACK accounts and all associated components such as payments, refunds, scheduled point activations, and commissions. This separation would enable us to scale point flows independently from the wallet domain and facilitate the easier introduction of new point-related features.

Constraints

Given the existing architecture and the critical nature of PayPay as a relied-upon service, we needed to approach this migration with caution. We prioritized careful design and planning, keeping these considerations in mind.

Timeline

PayPay is a fast paced environment and there’s always something in the pipeline for us to provide the best experience for our customers. We wanted to avoid impacting the new projects that were dependent on the completion of this migration, allowing us only about eight months to finalize the development and complete the migration.

Phased Migration

This migration is both significant and risky. We committed to doing everything possible to avoid breaking bugs or errors in the migration process. To mitigate the impact of any unforeseen issues on our users, we implemented a controlled rollout strategy. This approach allows us to avoid migrating all users to the new system simultaneously, enabling beta testing and monitoring the stability of the new system while rolling out the migration incrementally.

Downtime

To ensure data consistency during our complete flow and data migration, some form of cutoff—whether short or long—is essential for switching the APIs and data related to point flows. We had one scheduled maintenance window available, but we had to contend with its inflexible timing and duration. Outside of these rare windows (which occur only once or twice a year), we cannot afford significant disruptions to core PayPay services. PayPay is already widely recognized as a critical infrastructure in Japan that the people, both normal customers and merchants, rely on for a lot of things in their daily lives.

Since some point-related flows are integral to core functions—like payments—we had to consider the downtime strategy very carefully. We identified that we can tolerate limited downtime as long as it is isolated to point related feature usage and does not impact the other core services.

Design

Taking into account the requirements and constraints discussed earlier, we began designing and planning the migration. In this section, we will provide a high-level overview of our design choices and the reasoning behind them.

Data Migration Cutoff

Problem: Data SSOT Switch

As a financial service provider, we cannot afford mistakes in account balances; there must be a single source of truth at all times, ensuring that CASHBACK and POINT accounts remain fully synchronized. We also need to maintain backward compatibility with the old wallet system APIs while introducing new APIs for the updated flows. This, combined with our goal of a phased migration rollout, creates a complex problem to solve.

While achieving near-real-time CASHBACK to POINT replication is straightforward, our constraints led to several non-trivial challenges:

Unified Source of Truth: Both the wallet and point systems must operate in sync, recognizing that only one source of truth can exist at any given time.
System Interaction During Migration: To enable backward compatibility and a phased migration, the two systems must keep interacting with each other until the rollout is complete. The wallet system will need to consult the new point system for migrated users, while the point system must check with the wallet system for users who have not yet migrated.
Pre-Switch Verification: Before initiating the switch for a user, we must ensure that two conditions are met:

There are no active database transactions affecting the accounts targeted for the switch. This is challenging to guarantee for wallet account data due to the high frequency of point transactions, and we cannot accept new transactions until the switch is complete (simply queuing operations was not feasible because many flows are synchronous and tweaking them would require a massive migration effort in itself).
The data in POINT within the new system must be identical to the old CASHBACK data.

Migrating data that is continuously being updated poses significant challenges and necessitates a strategic cutoff during the transition. While we identified methods to minimize noticeable disruptions, each option comes with its own trade-offs.

Idea: The No Downtime Approaches

Although we ultimately chose not to pursue a no-downtime approach, we thought it worthwhile to examine our thought process. One design we found particularly intriguing involved a migrator job. The idea was to retain the old accounts in their original databases while simultaneously creating copies of existing CASHBACK accounts with a zero balance in the new database. Once this was accomplished, we would prompt the clients to transition to the new point service APIs. Depending on a user’s migration status, either the wallet system would reference the point system as the single source of truth (SSOT) or vice versa, with migration statuses centralized for easy access by both systems. Below is a rough sketch of the transaction flow.

During this period, a background job would handle the account migration. If executed correctly, this approach could facilitate a seamless transition with no noticeable disruptions to our service. The disruption time for each user would be limited to the duration of the migrator job’s operations, as illustrated in the diagram below.

However, after weighing this option against our requirements, we ultimately decided against it for several reasons:

High Development Effort: The complexity of this approach demanded significantly more development resources compared to our other options. The minimal disruption benefit did not justify the substantial development costs involved.
Increased QA Costs: The high development effort also translated to greater quality assurance costs, making it less viable, especially since this was intended for a one-time migration.
Heightened Risk: The complexity of the implementation raised the risk level considerably. Even minor errors in the implementation could lead to severe data inconsistencies, creating numerous potential points of failure.

There were other criterias in our consideration and we’re going to talk more about them when we discuss our final approach, but these are the main reasons pushing us to look in other directions. In summary, while the no-downtime approach had its merits, the associated costs and risks led us to pursue a different strategy.

Solution: The Final Approach

When prioritizing consistency over availability during data migration, the most straightforward approach might seem to involve a comprehensive maintenance window that takes the entire PayPay system offline. However, as we noted in the constraints section, this option is not viable for PayPay. Therefore, we explored several alternatives, carefully evaluating them against a few critical criterias:

Development Effort
Development Scope
Estimated Time of Arrival (ETA) for a Full Migration Rollout
Rollback Plan Complexity
Disruption Severity
Risk Assessment

After thorough consideration of our options and their trade-offs based on these criteria, we opted for what we refer to as a partial maintenance approach. This innovative strategy allows us to maintain critical PayPay functionalities while temporarily restricting customer access to their point balances, specifically during off-peak hours. Given our existing UI components and infrastructure, this approach proved to be relatively low in scope and straightforward to implement. While it may not be a perfect solution, it aligns well with our criteria and offers a balanced compromise.

We then set about designing and planning the specifics of this partial maintenance strategy. This solution dovetails with our phased migration rollout, wherein we will conduct multiple partial maintenance sessions, migrating select user groups during each phase. The maintenance will only impact designated users for each rollout. Here’s a high-level outline of our maintenance steps:

Display a maintenance banner to targeted users approximately 20 hours before maintenance begins.
Activate partial maintenance mode for affected users, during which they will be unable to utilize their points.
Ensure that all in-flight CASHBACK operations are either completed or terminated.
Initiate data verification checks.
Update migration configurations for targeted users, switching from CASHBACK to POINT in the new database.
Deactivate partial maintenance mode.
Remove the maintenance banner.

The entire process was designed to take no more than two hours. We bolstered our confidence in this approach by conducting pre-rollout data consistency verifications. This solution effectively mitigates the risk of data inconsistency during migration while minimizing disruption for our customers.

Data Migration

Problem: Data Synchronization and Consistency Challenges

During the migration of CASHBACK accounts from wallet point, maintaining data consistency is crucial. Users interacting with the old wallet system while the point service processes transactions can create risks of data mismatches if updates occur simultaneously in both databases.

Solution: Dual Write and Reconciliation

We implemented a dual write strategy where the wallet service updates both the old wallet CASHBACK data and a Kafka topic consumed by the point. This ensures that changes are reflected in both systems. Regular reconciliation processes check for discrepancies, ensuring accurate transaction capture.

Maintenance Strategy

During partial maintenance, targeted users enter maintenance mode, allowing effective data management.

Unprocessed Transactions: Transactions that remain unprocessed during maintenance will be retried afterward, ensuring accurate processing without duplication.

Migrating APIs

Problem: Flow Switch

Points were previously a part of the wallet system, existing alongside other wallet accounts and accessed together via a single API call. With the migration, we needed to guide our clients through a migration to the new flow that will treat points as a completely separate entity from the wallet domain, all while maintaining backward compatibility.

The backward compatibility challenge was complex, particularly due to our implementation of two-phase commit (2PC). We had to continue supporting the old APIs and workflows even after our data had migrated and clients transitioned to the new flow, as there are long-lived transactions to consider. As previously mentioned, even a simple debit or credit transaction involves prepare and commit phases. Additionally, other transaction types, such as preauthorization, can result in long-lived intermediary states. To address this, we needed to create an abstraction that allows clients to complete these non-terminal transactions using the old APIs, despite the fact that the data no longer resides within the wallet system.

We need to maintain two sets of APIs:

Old APIs: These are necessary for backward compatibility, ensuring the continuation of older transactions (Transition Flow).
New APIs: These support the new flows required for the phased rollout (Phased Rollout Flow).

Additionally, since we must accommodate a phased migration rollout mechanism—where the data source may reside in either the wallet system or the new point system depending on the current migration configuration—we developed an elaborate design to support both API sets effectively.

Solution Part 1: Transition Flows

This is the design to support continuation of legacy transaction schemes for a migrated user using the old APIs. It basically boils down to debit, credit, and account priority.

This image illustrates the account priority within the wallet system. Typically, the entire user wallet, encompassing all accounts, was accessed through a single API. Debit priority follows a top-to-bottom order, while credit priority is reversed. CASHBACK usually holds the highest priority for debit transactions and the lowest for credit transactions.

Credit

The credit scenario is simpler than the debit scenario. If a credit request includes the exact breakdown, we can credit all accounts, including the migrated POINT, as a distributed transaction. Typically, there are no dependencies between the credit transactions, making the process straightforward. In case of failures, we can easily manage them through automated retries.

Debit

Handling debits is more complex compared to credits. As noted earlier, the POINT (CASHBACK) account typically has the highest priority, creating dependencies between distributed transactions. The sequence for debiting must be as follows:

Debit from the new point system
1. Debit POINT
Debit from the wallet system
1. Debit Account C
2. Debit Account B
3. Debit Account A

To maintain performance, we aimed to avoid pessimistic locking on accounts. However, without this locking, it’s challenging to ensure that the requested amount for a point debit will be available at the time of the transaction, even if the balance was verified beforehand. The same applies to Accounts C, B, and A. It’s possible for a transaction to fail in the wallet system due to insufficient funds in these accounts, even if the POINT debit was successful. Typically, such scenarios are managed with an orchestrator and automated rollback, but we wanted a simpler solution.

Our approach is as follows:

Calculate the current total balance in the wallet system as TOTAL_WALLET_BALANCE
Determine the amount needed from POINT as a range:
1. MAX_POINT_DEBIT = REQUESTED_DEBIT_AMOUNT
2. MIN_POINT_DEBIT = REQUESTED_DEBIT_AMOUNT – TOTAL_WALLET_BALANCE
Maintain a state for this debit request to ensure idempotent behavior, reusing the calculated amounts even if the same request is retried.
Request the debit from the point system, allowing it to perform a best-effort debit within the range of MIN_POINT_DEBIT and MAX_POINT_DEBIT. If a debit within this criteria is not feasible, the request must fail, and the point system will return the actual debited amount, DEBITED_POINT_AMOUNT.
Upon receiving DEBITED_POINT_AMOUNT, recalculate TOTAL_WALLET_BALANCE (as it may have changed since the last calculation). If DEBITED_POINT_AMOUNT + TOTAL_WALLET_BALANCE is greater than or equal to REQUESTED_DEBIT_AMOUNT, proceed with the operation. Otherwise, automatically reverse the previously committed point debit and return an insufficient funds failure response.

With that approach, we are able to keep a safe and idempotent behavior while also not having to create a new orchestration system.

Solution Part 2: Phased Rollout Flows

In a phased rollout, migrating APIs to a new point service requires handling transactions carefully to avoid errors, particularly when some users are migrated and others are not. The challenge is ensuring seamless transaction handling when users switch between systems, and avoiding issues like duplicate transactions.

Problem: Segregated Transactions and Risk of Double Processing

During the rollout, users who have not yet migrated to the new point service will still interact with the old wallet system. The point service acts as a proxy for these users, redirecting their transactions to the wallet. Once migrated, the point service processes their transactions directly. However, this leads to a situation where a transaction might start under the wallet system and finish under the point service, causing potential double processing if the point service doesn’t have the full transaction context.

Solution: Transaction Attempt Tracking

To prevent double processing of transactions during the phased rollout, the point service must track any proxy attempts made to the wallet service. Before processing an operation, point-main will consult wallet-main to check for any previous attempts. If a user migrates mid-transaction, this log helps avoid duplications.

Maintenance Strategy

During maintenance, targeted users are placed into maintenance mode, and any unprocessed transactions are handled as follows:

Unprocessed Transactions: Any transactions that remain unprocessed during the maintenance window will be retried post-maintenance. This ensures that they receive the correct status and are processed accurately without duplication.

This approach minimizes the risk of transaction errors and ensures consistent processing across both wallet and point services.

ID Exclusivity

Problem

The risk of Transaction ID conflict emerges during the transition from wallet to point systems. When a request is made to retrieve a balance using a Transaction ID, the point system forwards any unrecognized transactions to the wallet system for processing. This step is crucial for maintaining backward compatibility, as there was previously no clear distinction between the old and new transaction flows on the client side. Without this separation, unrecognized transactions could result in inconsistencies and errors in balance retrieval.

If the point system generates a Transaction ID that matches one from the wallet for a different transaction, it risks returning incorrect results by referencing its own transaction. This issue predominantly impacts users who have not yet fully migrated, as older transactions remain intact, with the point system’s Transaction ID typically being ahead. In contrast, new transactions exist solely within the point system. Consequently, the potential for conflict is especially significant during these transitional phases of the balance-fetching process.

Addressing this challenge is vital for ensuring accurate and reliable transaction handling, which ultimately enhances user experience and maintains system integrity throughout the migration phase.

Solution

We’ve ensured unique transaction id generation between wallet and point systems using a custom implementation of the Snowflake ID generation algorithm.

Wallet system always sets the least significant bit to 0
The new point system uses the same UID generator but with the least significant bit set to 1
This approach prevents any Transaction id collisions during simultaneous ID generation across both systems.
The solution is scalable and maintains ID uniqueness for high-concurrency transactions.

Execution

We decided to split the execution of the migration into two parts. This decision was based on the understanding that migrating the client APIs and flows presented greater risks, while the phased migration rollouts were relatively lower risk.

API & Flow Migration

We decided to begin with the API and flow migration without switching the actual data source for any users initially. This phase is quite risky, as the wallet system is a core component relied upon by many clients. While we conducted extensive testing, we wanted to perform final verifications before going live. The scheduled maintenance window provided an opportunity for these last internal checks and tests during the limited downtime, allowing us to roll back if any critical issues arose. After thoroughly testing everything with our prepared scenarios, we confidently removed the maintenance and the migration of the flows went live.

Phased Rollout

We initiated the phased migration rollout as planned, starting from 0% and gradually progressing to 100%. At each of the partial maintenance + rollout stages, sometimes we were able to identify minor issues that we promptly fixed before advancing to the next phase. The rollout proceeded according to schedule without major blockers, ensuring a mostly seamless and invisible experience for users. It was gratifying to address even the smallest issues quickly, and we encountered no critical problems. Had any significant issues arisen, our phased migration approach would have allowed us to address them early, minimizing the impact on users.

Final Outcome

After completing the migration, we achieved our initial goals and more. We now have a dedicated team managing all point-related flows, enabling the development of new features that are completely decoupled from the wallet system. One significant feature was already in progress during the migration, aligned with the final architecture for a full rollout. This shift has greatly increased our development velocity, as the team can focus solely on point flows without the overhead of conflicts from a single, large wallet domain.

Additionally, we can now scale the infrastructure for point flows independently, which should lead to more efficient resource allocation in the future.

What’s Next

In an ideal world, we would have decoupled everything in one go. However, given our limited time and the extensive scope of the migration, we prioritized the most critical components, which we have successfully completed. While some tasks remain, such as addressing supporting services that still share application logic and database clusters, these will be easier to tackle now that the heavy lifting is done. We look forward to enhancing the modularity of these two domains and addressing any challenges that arise.

Navigating Complexity: The Migration Journey of PayPay’s Point System

Background

Existing Architecture

Wallet & Accounts

Money Movements

2PC API Design

Motivations

Objectives

Basic Requirements

Constraints

Timeline

Phased Migration

Downtime

Design

Data Migration Cutoff

Problem: Data SSOT Switch

Idea: The No Downtime Approaches

Solution: The Final Approach

Data Migration

Problem: Data Synchronization and Consistency Challenges

Solution: Dual Write and Reconciliation

Maintenance Strategy

Migrating APIs

Problem: Flow Switch

Solution Part 1: Transition Flows

Credit

Debit

Solution Part 2: Phased Rollout Flows

Problem: Segregated Transactions and Risk of Double Processing

Solution: Transaction Attempt Tracking

Maintenance Strategy

ID Exclusivity

Problem

Solution

Execution

API & Flow Migration

Phased Rollout

Final Outcome

What’s Next

Archives

Navigating Complexity: The Migration Journey of PayPay’s Point System

Background

Existing Architecture

Wallet & Accounts

Money Movements

2PC API Design

Motivations

Objectives

Basic Requirements

Constraints

Timeline

Phased Migration

Downtime

Design

Data Migration Cutoff

Problem: Data SSOT Switch

Idea: The No Downtime Approaches

Solution: The Final Approach

Data Migration

Problem: Data Synchronization and Consistency Challenges

Solution: Dual Write and Reconciliation

Maintenance Strategy

Migrating APIs

Problem: Flow Switch

Solution Part 1: Transition Flows

Credit

Debit

Solution Part 2: Phased Rollout Flows

Problem: Segregated Transactions and Risk of Double Processing

Solution: Transaction Attempt Tracking

Maintenance Strategy

ID Exclusivity

Problem

Solution

Execution

API & Flow Migration

Phased Rollout

Final Outcome

What’s Next

Archives

Discover more from Product Blog