As you can imagine, PayPay deals with a lot of information and data, which for privacy reasons we keep siloed and secure. However, for us to provide a superior user experience, we need to make it easy for users to search for their preferred products, stores or services on our app. We as an organization also need to be able to access data and analytics from across our user and merchant databases so that we can provide better customer support, develop more effective products and ensure our users have the information they need. Easy handling of data and providing the right information at the right time is a key backbone of our service.
Any delay in providing this information can lead to a negative experience for both our users and our merchants. This lag can be caused by search data across many databases and tables and the time it takes to retrieve and sort the information. To solve this problem and make sure that we focus on being efficient for our users, we use Elasticsearch.
Elasticsearch is a distributed, open source search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack.
Where to apply Elasticsearch
- Good for Full Text Search
- Good for Big Data (No Transactional Query)
- Faceting Search (Filtering)
- Geographical Queries.
Architecture and Data Structure
ElasticSearch stores all data in JSON format which can be accessed over HTTP using curl, web-browser, Rest APIs or any standard lib.
ElasticSearch conceptually has bags which can store these jsons data . These bags are called an Index which is similar to DataBases in the Relational Database Model. Inside each one of the Index, it stores multiple doctype which is similar to Tables in Relational model. It’s always better to provide the schema of the doctype explicitly. Multiple documents can be stored inside one doctype just like rows in a table.
How Elasticsearch actually searches
ES uses INVERTED INDEX to provide results for querying and searching. Inverted Index is a table basically which contains the mapping of each word to a doc_id.
In the above figure, word row appear in doc 0, 1 and 2, boat in doc 1 and 2 and chicken in 2. By the help of inverted index mapping, UNION and INTERSECTION can be easily applied to find out the relevant docs for a combination of words. However the above mapping wont help us to do full text search like “row boat” as it does not contain information of position of word in the doc.
How full-text search is implemented in Elasticsearch
Inorder to do full text search, ES stores the positions of the word along with the doc_id in the Inverted Index mapping table.
By the help of positions of words, relevant documents can be found out by doing simple manipulations as illustrated in the figure below.
To find “row boat” , filtering of docs can be done which contain only both the words. After filtering, appropriate logic is applied one by one to find out the docs which contain “boat” at a position just after the word “row” position.
Determining Query or Filter
Elastic Search provides both Querying and Filtering. Filtering is faster, cacheable and returns boolean results. On the contrary queries are slower, non cacheable, and retur fuzzy score as a result. Simple thumb rule is that Query should only be done when it is a must.
Elasticsearch at PayPay
At PayPay in the Merchant Team, we used a FILTERING method for searching application information for merchant registration. Why ? Because search query always included a bunch of specific IDs and field names, so a fuzzy query was irrelevant and more time consuming in this scenario.
Also we used ES for store and menu search functionality on PayPay pickup service, where we optimized and developed accurate search results specific for Japanese query.
But how we achieved it, that’s a long story for another blog.