System optimization is paramount. There are many techniques to achieve this but the goal is always scalability, performance, and reliability. In this article, we shall analyze an architecture that I came up with to identify the problems that we need to solve using OTP. One of the services (Mbanking) simulates a banking system and the other service (Mgate) simulates a payment gateway. This architecture is very common in fintech. We shall break this optimization into three parts and this article will focus on identifying the problem that we need to fix. Please, do not optimize prematurely without any measurements.
Mbanking Architecture
As you can see from the swimlanes diagram, this service is responsible for accepting payment requests and processing them asynchronously. The client which is Mgate will have to get the status of the request by calling a different API. This service has been built using Nest.js, and PostgreSQL and you can find the source code here. We shall assume that this is a third-party system that we don’t have control over. So we shall be making our changes to the Mgate service.
Mgate Architecture
Mgate is receiving payment requests from a client and forwarding them to the third-party service. As you can see, the client only receives the response if the third-party service has responded. This is one of the problems that we need to fix and we shall see why in a minute. This service was built using Elixir/Phoenix and PostgreSQL. Please find the source code Here.
The Problem
You might be asking yourself, if this design works (it actually works guys), then what is the issue? To find out, we need to understand how the Mbanking service works as illustrated in the diagram.
When a request comes in, it is validated, saved to the database, and added to the queue. The client gets a response with the status accepted. The background process will then reach out to the customer to authorize payment by entering a One Time Password (OTP). please do not confuse this OTP with the Open Telecom Platform. Because this requires external input from humans, it can take quite some time. When the user authorizes, the payment processing is complete and the gateway can request the status of this transaction via another API. For most of the FinTech applications, there is always a callback/webhook that is triggered when the payment operation is done (sorry, I did not implement this). Anyways, with this kind of design, The Mbanking service is able to handle a high volume of request by processing it concurrently. This leads to high responsiveness, better resource utilization, and increased throughput.
On the other hand, Mgate is making a synchronous call to the Mbanking service and is blocking until the Mbanking service returns a response. This is not a good design especially when we know that the immediate response from the Mbanking service is not a successful one (ie payment is not processed yet). When the Mbanking service is down for some reason unknown to Mgate, The request will be left hanging in the Mgate database and the client will need to initiate another request. This is not a good user experience. At the very least, Mgate should be able to retry or fail the request. As a result of all these, Mgate will not be able to process a high volume of requests which is usually the work of a gateway service. So, here is the list of glaring issues with this design
Blocking calls – Mgate makes synchronous calls to the Mbanking and blocks until it gets a response. This limits the system’s ability to handle a high volume of requests concurrently.
Single point of failure – If the Mbanking service is down, Mgate will always return an error.
Poor User Experience – Clients will have to retry their requests manually if the Mbanking service is down or slow leading to frustration and poor user experience
Lost Request – when Mgate is not able to reach the Mbanking service, it will have already saved the record and error out leading to implement or lost requests. This is not acceptable.
Scalability Issues – The current design of Mgate does not scale well with high load, hence limiting the number of concurrent requests
Resource Management – inefficient use of resources due to the blocking calls and poor handling of asynchronous calls
Idempotency Issues – an idempotent operation is an operation that when performed multiple times with the same input guarantees the same result. In our case, it is not.
How can we fix this
Asynchronous programming, efficient resource management, good user experience, scalability, and high throughput just to name a few. Mgate can achieve this by taking advantage of Elixir’s inherent features and ecosystem. Elixir has a concurrent model that is based on the Erlang VM (Beam). We are going to utilize some of these features to improve the performance of Mgate. The diagram below illustrates our goal. We shall use Tasks, Genservers, ETS tables, Supervisors, and Registry. All this will be explained in part 2 of this article. We shall also try to explain why we chose each technology along the process.
Is this all we have
There is never a silver bullet for anything. Performance optimization is an incremental process that must take time. With Elixir/Phoenix, the heavy lifting has already been done for us and all we need is to use it. We shall explore this in part 2 of this article.