Klarna Engineering - Medium

Beyond Prompting: How Algorithmic Evolution Doubled our Training Speed

Rex Lin — Mon, 30 Mar 2026 12:47:12 GMT

By Rex Lin and Valeria Verzi (Klarna Engineering), with Anant Nawalgaria (Google)

We knew our training pipeline could be faster — significantly faster.

At Klarna, one of our largest models (a transformer trained on vast streams of payment events and shared like infrastructure across many internal systems) is on a tight training loop. Speed is money at our scale. The opportunity for improvement was not in the hyperparameters, but in the plumbing: the way numbers moved between processors, the way memory was allocated, the way the model performed its most basic mathematical operations.

The challenge was scale. An engineer might try five or ten structural rewrites. A particularly ambitious one, armed with an AI coding assistant, might push to a hundred. But the full search space, the universe of possible combinations of precision formats, data pipelines, attention mechanisms, and gradient strategies, numbered in the thousands.

We partnered with Google to apply a different kind of tool. Instead of trying to prompt our way to a solution, we handed the problem to AlphaEvolve, which treats code optimization the way evolution treats organisms: generate candidates, test them, keep the fittest, repeat. Over three weeks and nearly 6,000 candidate programs, it doubled our training speed and, unexpectedly, produced a better model in the process.

The Machine That Writes Machines

AlphaEvolve is not a chatbot for code. You don’t interact with it through prompts. Instead, you build a sandbox around it.

The engineer’s job is to define what can change in the code and what cannot (passing only structural code snippets to the system, never customer data) to specify the metric that matters, and to set the constraints that must never be violated. You write hints in the form of code comments. You craft error messages that help the system learn when it fails. Then you step back.

The system takes over from there. It generates a candidate program, runs it, measures the result, scores it. The best candidates survive and seed the next generation. Candidates that violate constraints or fall below the quality threshold are discarded. This happens thousands of times, with no human reviewing individual outputs. The engineer shapes the environment; the machine explores it at a scale no person could match.

Why Speed Matters When You Train a Thousand Times

Scaling one of our largest models is not theoretical; it’s a daily challenge. With over 114 million customers and more than 3.4 million transactions each day, the data we process is constantly growing.

We run our training pipeline hundreds to thousands of times, depending on the stage of scaling, across different configurations and dataset sizes. Every minute shaved off a single run compounds across hundreds of iterations, directly reducing cloud-computing costs and accelerating development cycles.

The question was whether the code itself could be structurally reorganized to run faster, without sacrificing the model’s predictive accuracy or violating our strict requirement that every training run be perfectly reproducible. In regulated financial services, if you can’t reproduce the exact result, you can’t audit it, and you can’t deploy it.

What Evolution Discovered

The early generations found the obvious wins.

Mixed-precision training. Asynchronous data transfers between CPU and GPU. It instantiated high-precision tracking variables directly on the device, accumulating metrics entirely on the GPU without ever talking to the CPU, and pulling the final result only once at the end of the epoch, eliminating microscopic synchronization stalls. These are optimizations a seasoned engineer would eventually try, and they pushed throughput from 49 to about 72 samples per second.

Then we made a change that altered the trajectory of the search: we opened up more of the codebase. Initially, only the training loop was marked as evolvable. When we exposed the model’s forward pass and the data pipeline, AlphaEvolve began making deeper structural rewrites.

It stripped away many of the framework’s abstractions. It discarded PyTorch’s default Transformer objects, replacing them with a direct loop calling scaled dot-product attention kernels. By doing so, it realized it no longer needed to manually cache causal masks or pass dummy tensors: it let the bare-metal C++ kernels handle it.

It applied the same approach to the data pipeline. Instead of a standard setup where the data loader creates and pads new tensors for every batch, it pre-allocated the entire dataset into a single, massive, continuous block of memory before training even starts. It turned a complex data loader into a raw memory stream where batches were just continuous memory slices.

These are cross-cutting optimizations. Getting them right requires reasoning about numerical precision, memory layout, and computation simultaneously, the kind of optimization that is hard for humans not because the individual pieces are complex, but because you have to hold all of them in your head at once.

Combined, the changes brought throughput to roughly 97 samples per second under deterministic production constraints. That is roughly double the baseline.

Model quality improved, too. The best programs ran twice as fast while producing measurably better predictions than the baseline. The ultimate value of this search wasn’t necessarily just reducing costs (evolutionary searches themselves are computationally expensive). The value was in time. When a production pipeline runs in half the time, the speed of engineering doubles.

The Solution That Fired Itself

Perhaps the most striking result was something the system did to its own work.

At a training scale of 10,000 customer samples, evolution designed an elaborate stability mechanism that it named (unprompted) PASC (Proactive Adaptive Stability Control). It wasn’t just a simple failure check. PASC introduced a dynamic threshold to monitor the raw, unscaled loss during gradient accumulation. If a single mini-batch spiked, PASC flagged the cycle and waited; if the optimizer later detected unstable gradients, it executed an aggressive reset, dropping the gradients to save the training run. Some evolved versions even adjusted their aggressiveness based on the smoothed exponential moving average of the loss. At that scale, PASC appeared in every top-performing program.

Then we scaled up to 100,000 customer samples, still far from production scale, but enough to change the regime. Evolution dropped it. Simpler programs with fixed settings outperformed, revealing a broader pattern: mechanisms that stabilize small-scale training can become pure overhead as scale increases.

The system had not just found optimizations. It had found that one of its own best ideas was no longer needed, and discarded it. It arrived at the same lesson engineers typically learn over years of experience: that complexity is not always an asset.

Defining the Right Sandbox

The project required strict constraint engineering.

We did not enforce our determinism requirement from the start. The single largest throughput gain we found — a leap from 72 to 143 samples per second — turned out to rely on non-deterministic hardware shortcuts, aggressively enabling TensorFloat-32 and CuDNN benchmarking. Multiple runs had been spent exploring a path that could never be deployed. But this constraint ultimately became a catalyst. Once we forced the system into strict determinism, shutting off the hardware-level shortcuts, AlphaEvolve was forced to invent the architectural rewrites (the memory stream and the bare-metal attention bypass) to claw its way back up to 97 samples per second.

We also learned the importance of early stopping. Our largest experiment logged 631 consecutive evaluations with no improvement. Throughput had effectively capped. In hindsight, we learned that without a strict plateau threshold, the system will continue exploring marginal paths.

The choice of underlying language model mattered more than we expected. Early runs used an older model that produced syntactically broken or crashing programs roughly two-thirds of the time. Switching to a newer model pushed the success rate to between 86 and 97 percent, transforming the economics of the search.

Beyond the Training Loop

At the scale we operate — millions of customers and continuously growing datasets — these throughput improvements from this single project translate into a significant compounding effect on iteration speed and compute efficiency. And that was just one training loop for one model.

We see the same pattern elsewhere in our business: a clearly defined objective, hard constraints, and a search space of code-level decisions too large for any human to explore. We are now experimenting with AlphaEvolve on other high-stakes modeling and simulation problems where constraints are tight and the implementation details dominate runtime.

The broader implication extends beyond our work. For years, the conversation around AI and software engineering has centered on generation — what can the machine write when you tell it what you want? This project suggests a different, and possibly more consequential, question.

The future isn’t just what you can prompt an AI to write. It is what algorithmic evolution discovers when you let machines continuously redefine the limits of your work.

Acknowledgement

This project was a collaboration between the Klarna team, including Rex Lin, Valeria Verzi, Goran Dizdarevic and Sudhakar Periyasamy; the AI for Science team at Google Cloud, including Kartik Sanu, Laurynas Tamulevičius, Nicolas Stroppa, Chris Page, Skandar Hannachi, Vishal Agarwal, John Semerdjian, and Anant Nawalgaria; and our partners within Google Cloud Account team: Nikola Rubil, Karl Helling and Google DeepMind.

Follow Klarna Engineering for more.

AlphaEvolve is developed by Google DeepMind. Learn more at deepmind.google.

Beyond Prompting: How Algorithmic Evolution Doubled our Training Speed was originally published in Klarna Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How I stopped worrying and learned to love Cloud Inventory

Maxim Savin — Fri, 06 Jun 2025 06:38:06 GMT

A long time ago, as a punishment for his crimes, Hades, the king of the underworld, made Sisyphus roll a huge enchanted boulder endlessly up a steep hill. Since then, many tech companies have learned to do that at scale by the hardships of cloud configuration management.

Consider an Engineer who wants to ensure that the data that moves through their system is encrypted along the way. This is a noble goal, and to achieve it they must identify every classic load balancer in their AWS environment to replace it with an application load balancer that enforces encryption in transit. Now imagine doing that at the scale of a company like Klarna, where teams collectively own more than a thousand AWS accounts? Add to this a multitude of other configuration challenges — databases that have not been deployed in a multi-availability zone set-up, missing Cloudwatch logs, expired digital certificates, systems running on unsupported framework versions — the list is endless. Identifying and rectifying violating cloud assets often feels like an endless game of whack-a-mole played blindfolded. This is the steep price tech companies pay to operate their systems securely and confidently, day by day.

Klarna Engineering Platform (KEP) has been on a mission to facilitate configuration management for Klarna Engineers. After a few iterations we have built an ecosystem of Klarna services designed to collect, normalize, map, and serve data on ICT assets within Klarna’s cloud infrastructure. We call this system Cloud Inventory.

Over the last few months Klarna has:

Rolled out over 100 automated controls enhancing every aspect of our configuration management (security, governance, and operational excellence), each control aimed to help system owners to identify and fix violations quickly.
Reduced the lead time of rolling out a control from several weeks to a matter of minutes
And as a result, successfully completed several large-scale cloud infrastructure optimization projects, such as a company-wide clean-up of RDS and EBS snapshots, without any incidents.

In what follows, I will walk you through the foundation and the building blocks of Cloud Inventory as well as share some of the use cases it enables.

In the beginning there was a SystemID

It is often simple statements that are most important. Let’s begin with a fundamental statement about configuration management that the whole Klarna technology rests on: every system at Klarna must have a SystemID. A SystemID is a unique identifier of a system. It is hard to give a “scientific” definition of what the scope for a given SystemID should be, but we are following some guiding principles:

A SystemID should not cover multiple things that are naturally handled by different teams
A SystemID should cover things that are logically developed as a unit which interact through non-published APIs, loosely defined as “code base”

SystemID is required in several key processes of the software development lifecycle. For example, without it, one cannot create an access group or set up an AWS account. This simple yet powerful concept helps to define the boundaries of a system. Klarna maintains a dedicated systems registry in order to store and manage the lifecycle of SystemIDs. In the context of Cloud Inventory, System Owners are required to tag their cloud assets with SystemID tags (preferably by configuration), and it enables everything — ownership, accountability, governance.

Cloud Inventory

Now, onto the main secret, which is quite simple! As the whole Universe is based on a concept of graph theory where love is an edge, no wonder graphs are so helpful in configuration management. We employ a graph database solution to build a model of our cloud inventory featuring teams, systems and related ICT assets. For example, the graph example below represents a system “Klarna App” along with its assets and dependencies.

Pic 1. A system and its assets represented in a graph

This representation includes both cloud resources owned by the system — such as instances, databases, security groups, load balancers, VPC endpoints — as well as ICT assets external to our cloud providers, such as artifact repositories, threat models, user access groups, experimentation platform features, Kafka topics, and more. The most valuable aspect is that all these assets, residing in various siloed sources of truth across the company, are unified into a single graph with established relationships — all thanks to the little thing called SystemID. This also explains why we do not tend to rely on AWS-specific solutions like AWS Config — they are severely limited since they can only involve AWS data but also lack context when it comes to change management (more on that below).

How do we achieve this? There are two main components to Cloud Inventory:

A graph database. We currently use a third-party platform JupiterOne which provides us a database that has direct integrations with our cloud providers. The database acts like a “giant cache” of Klarna’s cloud inventory, fully refreshed every few hours.
A set of ETL jobs. We have built synchronization jobs that run at specified polling intervals to set up custom integrations with our internal sources of truth — such as organizational data and the systems registry. These integrations are established through dedicated S3 buckets with distributed ownership, where different teams own various parts of the inventory.

The central question that the Cloud Inventory solution enables us to answer is: “What is the current state of what we own as a company in the cloud?”. With the support of the graph query language, we can break down this question into specific queries, obtaining granular-level answers.

Below, I will present a use case illustrating how we assist Klarna system owners in optimizing their inventory — leading to operational cost savings and fewer incidents.

So what is a target spec?

Every engineer’s goal is to make sure their systems run reliably, securely and efficiently. Suppose a Klarna team runs a system with an RDS instance. This team has a dashboard displaying reports, such as one suggesting that “All production databases must be enabled for multiple availability zones”. Teams are expected to act on these reports within a specified SLA.

We call these reports “target specifications” because they define the target state which we expect an asset to achieve. Every target specification has several qualities:

It has context — every target specification will have a short, concise and relatively human friendly description of what is expected of the team and why it is important to address it.
Evidence collection is automated — scoring whether an asset meets the expectation or deviates from it is expressed as a simple query
Evidence is precise and verifiable — evidence of an asset meeting or deviating from the expectation is on the exact configuration item (e.g. the exact database instance)
It is actionable — it explains how to achieve the target in a user-friendly way. The goal is that an Engineer should have everything they need to act on a target spec just by having a glance at it.
It is configured as code. Every target specification is essentially an .md file in a repository, meaning that anyone with basic Git knowledge can contribute and propose changes to both asset identification methods and the instructions provided to asset owners.

As you’ve probably noticed, target specifications are just organized and scheduled queries to the graph database, identifying deviations from the target state and exposing them to system owners.

Pic 3. How a target spec looks like in the repo

And what about impact? In the past few months, Klarna has released more than 100 target specifications covering nearly every aspect of configuration management — from database operations and tagging to logs and encryption. Target specifications have scaled well and are on track to fully replace manual surveys and reviews previously used for governance. They have proven highly effective. For instance, in enabling Multi-AZ for RDS instances, the number of identified violations has halved and continues to decline steadily.

Another straightforward example is a target spec for untagged database snapshots older than 90 days. At the time of its introduction, Klarna had roughly 1,500 such snapshots in its cloud inventory. One month later, this number dropped to about 500 as system owners proactively reduced waste, resulting in €10k — €20k monthly savings on the AWS bill from a single target spec.

Target specifications powered by Cloud Inventory have significantly enhanced Klarna’s security, reliability, and performance.

Pic 4. A steady decline in the number of violating assets after activation

Lessons Learned

We’ve been working on Cloud Inventory out for a bit more than a year and here are some of the learnings so far:

Ownership matters. Since Cloud Inventory touches nearly every source of truth within the company, it was crucial to define the owners of the running integrations and data in the graph. Clear ownership is key to successful scaling and incident resolution and without it, things are doomed to deteriorate over time.
Change management starts with being clear on expectations. The set format of target specifications forced us to formulate expectations in a concise and clear manner, increasing the likelihood of teams acting on violations. We also learned that even small changes with a limited scope are best conducted via target specs, as they can be scaled very gracefully and teams benefit from having every kind of change request on a single dashboard.
Focus on the data model when dealing with graphs. We quickly learned that the complexity of AWS integration alone is substantial. Scaling Cloud Inventory to 20+ different integrations was an opportunity to introduce chaos. Therefore we dedicated considerable time to aligning data standards (e.g., entity classes and types) and naming conventions to ensure Cloud Inventory scaled gracefully.
Simplification is key. Overall, in the case of target specifications, we chose to opt for a very straightforward, almost “one-size-fits-all” model targeting every type of asset and it suited us very well.
Show value early. Demonstrating immediate direct value was essential to convince our customers within Klarna to integrate their data sources with Cloud Inventory. Initially, the central team drove the integration process by selecting and implementing sync jobs with the most relevant data sources. However, after a few successful cases that proved the value of established integrations (such as the ability to set up relevant target specs), we began to see data source owners proactively proposing integrations. We’ve transitioned from a “supplier push” to a “customer pull” in our rollout.

Next steps

Although Cloud Inventory is now an integral part of the Klarna Engineering Platform, our journey is far from over. Our current goals to advance it further include:

Expanding data sources. We still have some data sources not yet covered, which we are eager to bring in to enable more transparency and control. Our current focus includes ingesting an inventory of Jenkins instances, team on-call schedules, and code repositories.
Leveraging AI. We are actively using Klarna’s internal AI Assistant to help teams to validate any manual inputs to the graph’s entities. For example, AI is already helping to reason about and cross-check information on system attributes (such as availability, integrity, confidentiality classifications) and system dependencies. In the future, we aim to empower our users with AI-driven support for managing the data model and constructing queries to the graph.
Making better use of positive evidence. We aim to expand the perspective of target specifications as automated controls, integrating “good evidence” in addition to violations. This will not only act as positive motivation for system owners but also align with risk control frameworks and facilitate regulatory compliance.

By continuously improving Cloud Inventory, we strive to set new benchmarks in cloud configuration management, driving operational excellence, and assisting Klarna system owners in making smart choices about their cloud assets. I am excited to see what comes next!

How I stopped worrying and learned to love Cloud Inventory was originally published in Klarna Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Learnings from a Klarna Engineer on feature development

Julien Avezou — Mon, 14 Apr 2025 09:05:53 GMT

In the world of FinTech, where regulations and innovation collide, my team at Klarna implemented new ways for users to recover access to their Klarna Card and Klarna balance accounts on their device. This posed a series of challenges with a whole range of aspects to consider from product, usability, security, scalability and regulatory. With the feature now released, I would like to share key learnings from working on this complex feature.

Documentation is Key

Starting to document early on in the project is crucial. We generated documents at all stages of our project in order to gather feedback at a technical, product and design level. This helped in aligning with teams, enabling smooth collaboration, and setting a solid foundation for feature development and delivery.

These documents allow close and continuous collaboration between stakeholders from both product, technical and design perspectives from the early stages of feature discovery to the start of implementation. However documentation doesn’t stop there, once the implementation is done, it is equally as important to capture in writing your feature from both product and technical aspects, supported by architectural diagrams and swimlanes, so that stakeholders within the company have a clear reference when interacting with the feature in the future.

Always communicate

Establishing open communication channels, such as dedicated Slack channels for internal stakeholders, and providing regular updates in order to foster collaboration. Promptly raising blockers and organizing group discussions with colleagues from various competences ensures transparency and helps in addressing challenges effectively. Having an open channel to discuss and update on progress also helped streamline our communication and avoid unnecessary meetings. Having an open channel also serves as an accountability mechanism and a great way to keep track of the conversations and topics over time. The dedicated channel also provides an easy way to share documents and highlight the important ones for quick and easy access by adding them to the channel bookmarks. Our team saved countless hours spent in meetings by setting up and maintaining a dedicated Slack channel from the start.

Squash biases with Bug Bashes

A Bug Bash is a structured testing event where team members and stakeholders come together to identify software bugs collaboratively based on a set of instructions and guidelines. A Bug Bash serves as a valuable session to spot hard-to-find bugs resulting from tunnel vision and owner bias. By leveraging the diverse talents within the company and inviting representatives from various stakeholder and non-stakeholder groups, we identified critical issues early in the process, enhancing overall product quality. For example, one stakeholder tested a flow in the app that their team owns as an integrator, which we didn’t cover on our end in our testing. This led to the discovery of a major bug that we fixed in time before the release.

Beyond Engineering, think Product

Combine the product and ownership mindset with engineering to ensure alignment is kept between development efforts with the product roadmap. Understanding user needs and product goals ensures that engineering decisions are driven by a customer-centric approach. Through taking ownership of the product, we organized regular demos with stakeholders. These touchpoints were extremely valuable to us, keeping us accountable while also getting early feedback in the process. A better understanding of the product also allowed us to better grasp the potential value and impact of our feature and how to measure impact after the release, in terms of metrics and KPIs. With this mentality, we also better understood the needs of all kinds of stakeholders including internal ones. We baked in mechanisms for easier internal testing and improving Developer experience within the company as a whole. An area often overlooked when building new features.

Pragmatism in Execution

Striking a balance between quality and efficiency is vital. Having a clear implementation strategy, understanding the release process, and being accountable for milestones helps in delivering high-quality features within set timelines. As an example, we started with the in-app screens while the technical discovery was still ongoing, so that we already had screens to work with when adding the logic. We also took the decision to implement in-memory databases for our models before actually persisting the data once we were fully confident in our data handling and relationships between the different models, thus allowing us to move faster and avoid subsequent unnecessary database migrations.

Teach, Support, Learn, Repeat

Understand early on your own strengths and weaknesses and how they fit within the dynamics of the team. This way you can optimize the time spent by supporting the rest of the team with what you already know while also sharing this knowledge by teaching others. This can be done via pair programming sessions, hosting frequent and concise knowledge sharing sessions, and fostering a culture of collaborative code reviews. If everyone shares the same mentality, you have the opportunity to learn from others, so seize this moment to expand your knowledge, learn new things, improve on your weaknesses or gaps in knowledge. For example, I learned about different backend strategies to encapsulate logic within service functions and learned to work with redux toolkit for the first time in the client.

Be smart about your time management

To stay focused on the implementation of complex features while managing daily tasks and duties, developing a structured approach is essential. A strategy used, such as the 3/3/3 method can be particularly effective. (Oliver Burkeman, author of “Four Thousand Weeks: Time Management for Mortals.”) Begin your day with a focused 3-hour deep dive into feature development, followed by accomplishing 3 urgent tasks and then addressing 3 maintenance tasks. This cyclical routine enables you to prioritize your core feature work while ensuring that essential daily tasks are consistently managed. There are days where it is difficult to respect this structure, due to incidents occurring, open investigations or workshops, but this is completely fine. This time management technique only serves as a general path to follow, not as a rigid rule to abide by. It can sometimes be beneficial to break the routine and have a mob programming session or hackathon to dive deeper into a certain topic.

Don’t overlook code hygiene and maintenance

Adhering to the revered “Leave the code cleaner than how you found it” principle, inspired by the Boy Scout rule outlined in Clean Code, embodies a crucial philosophy for developers. Actively seek opportunities to enhance code quality through refactoring, thereby making it more organized. For instance, our team invested efforts in converting our JavaScript codebase over to TypeScript when files were touched as part of the feature development. We also ensured that existing routes comply with our internal guidelines and software best practices. These represent proactive steps toward maintaining a high standard of code cleanliness. To emphasize this point further, our team experimented with dedicating 2 days each month to maintenance work.

Don’t forget to enjoy the ride

Last but not least, don’t forget to have fun while learning! Working on a big feature demands a high level of effort and can exert a lot of pressure and stress. Even so, embrace the challenges and be grateful for the opportunities to learn and grow alongside your colleagues. We experienced a major incident following the initial release which was stressful. However we transformed this into an opportunity to dig deeper into the existing code to understand and solve the issue, which also led us to solving longer standing issues.

This article unraveled the intricacies of developing the Account Recovery feature at Klarna, shedding light on the learnings gathered in the process. Embracing these key takeaways helped my team and I pave the way for success in navigating this complex feature, within a fast-paced FinTech landscape such as Klarna. I hope these learnings can also serve you when implementing your own features.

Enjoyed the post and want to stay updated on our latest projects and advancements in engineering?

Join our Klarna Engineering community on Medium and LinkedIn.

Learnings from a Klarna Engineer on feature development was originally published in Klarna Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How micro should your microservices be?

Raya Rizk — Mon, 24 Mar 2025 07:22:27 GMT

Our journey towards striking the right balance

The debate between monolithic and microservices architectures is a hot topic in software development. While monolithic systems are known for their simplicity and tightly integrated structure, they face challenges with scaling and flexibility. In contrast, microservices offer greater scalability and autonomy in development, but carry complexity in inter-service interactions. In this article, I’ll share our experience in navigating between these two paradigms while working on a recent project at Klarna, exploring different architectural decisions while addressing a fundamental question: what is the optimal granularity for a microservice?

From monolith to microservices: the company is growing 🚀

A monolithic architecture is often the natural starting point for businesses, serving well initially but revealing its limitations as organizations scale. Klarna was no exception. Like many in the IT industry, the company embraced the microservices paradigm alongside its rapid growth a few years ago.

In the payments domain, we are focused on offering customers various payment options, allowing them to choose between paying directly, later, over time, or through other tailored methods. This demand for diverse options led each payment method to evolve into a distinct microservice, managed by dedicated teams. Each payment service acts as a key orchestrator in the purchase flow, coordinating with other services to guide customers through the required steps until order completion.

This adoption of microservices naturally aligned with the company’s organizational structure, offering team autonomy and the ability to scale services independently. Each service was self-contained, with its own database and code residing in a separate repository. By decoupling payment options into distinct services, we gained greater flexibility and enabled faster development cycles, as teams could focus on their respective components.

Landing the distributed monolith: a complex setup 🏗️

While the different payment options at Klarna serve unique purposes, they share underlying similarities. Regardless of the payment option a user selects, the end goal remains the same: “to pay”. Certain checks and steps are common across all payment options, such as user authentication, funding source collection, and risk assessment.

Since the microservices were handling similar functionalities, any new requirement had to be implemented across all services. Additionally, changes or maintenance needed to be executed multiple times by different teams. This led to numerous instances of duplicated code, challenging maintenance, scalability issues, and the unexpected cost of managing multiple services.

This situation raises critical questions: Do we really require microservices? Or has our excessive division led to the creation of a distributed monolith?

Microservices excel when applied to truly independent business domains with distinct scaling needs, separate release cycles, and autonomous teams. However, our use case was different — we encountered shared business logic, interdependent features, and similar scaling patterns. This, unfortunately, resulted in us creating a distributed monolith, with an ever-increasing complexity due to multiple codebases, databases, and deployment pipelines. The organizational impact was significant, as teams spent more time coordinating changes than delivering value, while struggling with the cognitive load of managing multiple services.

While microservices may appear well-suited for diverse scenarios, they are not a one-size-fits-all solution. Sometimes, even the best-intentioned implementations can lead to unforeseen costs. Adopting microservices, or any solution for that matter, can address certain challenges but may also introduce new, complex ones.

From microrepos to monorepo: a recovery attempt 🚑

In an effort to address the issues we faced with our microservices setup, we attempted to share common functionalities across services by reverting to a monorepo structure while keeping the services separate. The monorepo aimed to centralize code management and reduce redundancy, offering easier collaboration and alignment across teams. However, this quickly resulted in a monorepo cluttered with duplicate code, making it challenging to navigate and understand. Extracting common functions was both difficult and error-prone. Although the functionalities were similar, slight variations in their implementation complicated the process, making it anything but straightforward. Consequently, the cost and effort required to manage and maintain this setup remained high.

While monorepos can facilitate code sharing and cohesiveness, they require strong governance and standardization across teams to be effective. Merely consolidating code without addressing fundamental architectural challenges proved insufficient.

From distributed monolith to modular monolith: a better future ☀️

Faced with the challenges of a distributed monolith, what options do we have? Should we simply revert to the traditional monolith we had, with all its known limitations? Not necessarily.

Instead, by carefully considering both business needs and technical contexts, and following Domain-Driven Design (DDD) principles, we can identify a more balanced solution. DDD emphasizes structuring software around core business domains, recognizing that the ideal size of a service often aligns with a broader business domain or entity rather than focusing on granular features. For instance, instead of concentrating on individual “payment method options”, we consider “payments” as a cohesive whole. This shift in perspective allows us to consolidate common functions and organize the system into well-defined modules that focus on shared features, resulting in a more appropriately structured architecture.

In practice, our modular approach goes beyond simply separating the available payment options (“pay now”, “pay later”, “pay over time”, etc.) into distinct modules. Instead, we break down the application based on features. This results in clearly defined modules addressing specific functions — such as authentication, funding sources management, and risk assessment, among others. Each module represents a specific business domain with clear boundaries and responsibilities, enabling better organization and maintenance of the system.

By dividing the large system into smaller, manageable modules within the same application and limiting interactions across their boundaries, we achieved the desired balance: one service, one database, and a structured, clean codebase. Each module operates within a well-defined context, enabling isolated development and testing while eliminating the overhead of remote calls and the duplicated functionality that is common in microservices. In other words: thoughtful modularization combines the simplicity of a monolithic architecture with the flexibility of loosely coupled components.

The benefits extend beyond code organization. The modular architecture enhances observability through standardized logs, metrics, and dashboards, eliminating the complexity of managing numerous service-specific monitoring solutions. This unified strategy simplifies troubleshooting while ensuring consistency across the entire system. Moreover, these well-defined modules become natural candidates for future microservices if needed, making the modular monolith an excellent foundation for incremental architecture evolution.

Although this approach has clear benefits, it also brings challenges. The modular monolith presents an increased risk of a single point of failure; if the application crashes, it can impact the entire system. Additionally, managing a unified codebase across multiple teams in a single monorepo requires clear standards and practices for coding, testing, and deployment. Coordinating development efforts, with numerous commits and simultaneous changes, requires robust communication and aligned release cycles.

The power of modularity: striking the right balance ⚖️

While we consolidated the payment services into a modular monolith, this single service itself remains a key microservice in the purchase flow at Klarna. It continues to both integrate with and be integrated by many other services.

A significant challenge remains in creating a universal payment flow that accommodates different payment options across various markets, each with distinct functionalities and market-specific requirements. The solution here lies in customization through configurability. By developing one configurable flow, we can create a versatile set of features and steps that function like puzzle pieces or modular building blocks. These components can then be assembled and tailored through various configurations to efficiently support diverse payment options.

The migration to the single service is still ongoing, yet the new solution is already proving its effectiveness. By leveraging configurability, we were able to add new features to existing payment options and launch new payment options in various markets within days with minimal development resources — simply by adjusting configurations and running tests, many of which passed on the first attempt. Previously, such deployments would take weeks or even months, requiring coordination across multiple teams and extensive testing to ensure a unified solution. This streamlined approach has significantly reduced both time to market and resource requirements.

In terms of load, the service already handles over a million orders daily, demonstrating its robustness and scalability. This capability was put to the ultimate test during Black Friday, Klarna’s most challenging day for peak load and purchase volume. The system not only withstood the stress test but set a new record by processing one-third of all Klarna transactions that took place that day. Given this proven performance and significantly lower resource usage, the financial advantages become clear — why incur the overhead costs of microservices when a dynamic modular monolith can handle such scale efficiently?

We look forward to completing the full migration, which will enable streamlined, unified processing for all Klarna payments while maintaining optimal resource utilization.

Key takeaways: optimizing architecture choices 🎯

Understanding the nuances between different architectural styles is essential for making informed decisions. The following table offers a side-by-side comparison of microservices and modular monolith approaches, highlighting their strengths, challenges, and typical use cases.

In software development, cyclical trends are inevitable. As projects and organizations evolve, growth may necessitate splitting services again. By establishing a robust framework and embracing meaningful modularization — guided by Domain-Driven Design (DDD) principles and its focus on core business domains — teams can adapt their architecture to meet changing requirements without compromising system coherence or maintainability. The experience gained from modularization has revealed a key advantage: modules can be seamlessly extracted into microservices when needed. This flexibility enables teams to work within well-defined bounded contexts while maintaining the agility to restructure the system as necessary. Ultimately, this approach provides a sustainable path for continuous growth and innovation, allowing teams to navigate software development cycles with greater confidence and efficiency.

Acknowledgement

Many thanks to all the contributors for their support on this project, especially Alessandro Dal Bello, Batıkan Türkmen, Carlo Micieli, Francesco Maria Chiarenza, Mikael Vessgård, Niklas Peil, and Sanchi Goyal.

How micro should your microservices be? was originally published in Klarna Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

The fellowship of the forgotten

Onno Vos Dev — Wed, 26 Feb 2025 09:04:27 GMT

How we migrated from Mnesia to Postgres with zero downtime

Back in December 2004, an Erlang application was born called KRED (referring to the freshly-started company called Kreditor, now known as Klarna). KRED is one of the “servicing systems” at Klarna and keeps track of consumer debt (among other things). It was powered by Mnesia and consisted of a cluster of 7 nodes, each holding a full copy of the database on disk. The data was replicated using a custom replication mechanism built in-house by Klarna. One node was elected as the leader and its database was considered the source of truth in the system. All database transactions were executed on the leader and writes were replicated to the rest of the nodes, the so-called followers.

The Mnesia database was around 15 TB and at its peak in 2018 around 1.3 TB was held in memory at all times. Considering that few suppliers were selling hardware with such specs, it’s easy to claim the crown of one of the biggest Mnesia databases in terms of in-memory storage, that was running in production. The rest of the data was offloaded to disk using mnesia_eleveldb.

KRED has been a stable workhorse at Klarna so why change a winning concept?

Get ready, for a two part blog post where we’ll first go through our journey of how we went about this and secondly, how we made Mnesia behave just like Postgres and implemented our version serializable isolation level on top of Postgres!

How the journey started

Three engineers, sat down in a bar in Stockholm, Sweden and asked this question:

‘When Klarna truly takes off, will KRED survive? Assuming “no”, and presented with a blanco check, how would we tackle this problem?’

The answer quickly revolved around the issues of running Mnesia on an even larger cluster and with leveldb compaction hitting some hot tables during peak times. One can only imagine how that problem would just continue to get worse over time.

Considering the three engineers had worked on KRED for a long time, scaling KRED was an intriguing thought and the evening was concluded with an agreement to explore this plan in depth further. While some afterwork talks never lead to anything productive, this one did!

The trio had several sessions together and drafted a plan on how to solve KRED’s scaling issues by sharding the system across multiple clusters. An ambitious plan and certainly a challenging one. But hey, shoot for the stars, aim for the moon, right?

The deal was simple: give us a blank check, a few engineers, a couple years of runway we will execute on the plan with the newly formed team. Well, I wouldn’t be writing this article if that meeting didn’t go well. The target was:

“Scale KRED in whatever way you see fit, it may be sharding, it may be something else but scale it, that’s the goal!”

A team of 6 Engineers, a manager and a product owner were assigned to the task and it was time to start hacking!

Investigation phase

Sharding KRED
Since our goal was to: “Scale KRED in whatever way you see fit”, our first task would be to investigate exactly “what way” would be “the right way”. Seeing as the team already had a plan on sharding KRED on paper, we started investigating whether or not this plan was actually feasible.

The idea was simple: each KRED cluster would elect its own leader, use Mnesia, leveldb and run the exact same KRED codebase and follow the same release process. Each cluster would be of a limited database size to avoid hitting limits such as number of database locks obtained per second, number of transactions committed per second, amount of bytes written per transaction and the amount of disk space and RAM required by each instance. Each of these were limits that KRED would eventually hit one way or another.

After a thorough investigation the idea was put on ice. The reason was simple: there were too many complexities and missing pieces of the puzzle to make this happen. Sharding KRED required a lot of surrounding systems to be adjusted to operate against more than one KRED cluster and none of them were ready for such a change at the time. Furthermore, several critical pieces that were required weren’t even built so it was decided, Klarna was not (yet!) ready for a sharded KRED.

Moving the database out of the application nodes
One rainy day in November 2019, the team decided to do some mob programming just to see if it was possible to run KRED on a SQL-database rather than Mnesia. The rationale was simple: using a 3rd party SQL-database would avoid a lot of the issues we’ve seen on Mnesia and perhaps, migrating to a SQL-database would be the answer?

We’d be getting:

Fully managed — ergo a lot less maintenance overhead;
Cost effective — using a single database instead rather than having seven full copies would be a lot more cost effective;
Reliable — despite us not having too many difficulties with the reliability of Mnesia, we were one of the biggest users of Mnesia and mnesia_eleveldb. Hence, we did not have the luxury of having the community find bugs and issues before they hit us like we would when moving to Postgres.
Ability to independently scale the application nodes from the database;
Ability to leverage modern database features such as efficient secondary indexes, rich query language and schema evolution;
Standardized tooling around the operational aspects such as backups and deployment.

Simple plans are never simple

In its essence the plan was simple: introduce a follower node running Postgres as the database backend and slowly push it towards the leader role one step at a time. Once the leader is backed by Postgres, the Mnesia nodes simply become followers and we can replace them with KRED nodes that no longer hold a database on disk, but talk directly to a Postgres database.

Just like our original sharding plan, this plan sounded simple, but in practice there were a lot of moving parts involved and plenty of unknowns to be solved. The key to success in this project was to break each step to the end goal up into smaller epics where each should bring value to KRED regardless of whether or not the end state would even be achieved. Ergo, even if we wouldn’t be able to migrate KRED to Postgres in the end, we should end up with a better KRED either way. At any point in time, we should be able to pull the brakes and still have a net positive effect on KRED. Considering the large amount of unknowns of a project of this size and risk, this has been key throughout the past years in order to not feel overwhelmed by the end state but rather stay focused on smaller deliverables where each contributed to our end state. Only at the very end of the project, the deliverables should be allowed to grow in size and be more and more focused towards the actual migration. And only in the very last step, we should have a point-of-no-return type of moment.

Our master plan

The compatibility layer known as kdb
KRED business logic has not been allowed to interact with Mnesia directly for many years. Instead, all calls went through a wrapper that we controlled called kdb. The original purpose of kdb was to implement our in-house data replication mechanism but it doubled perfectly as a compatibility layer during this migration. We decided to keep the external API of kdb largely intact but behind the scenes allowed the storage of data to be either in Postgres or Mnesia.

With this approach we could minimize the changes in the business layers. A few minor changes had to be made however such as introducing multi-read and multi-write functionality. This was trivial to implement on Mnesia as you’d simply iterate the 10 keys or records and call out to Mnesia one by one. This was still fast as the data is local to the OS process. On Postgres however, reading 10 records from a single select was much faster than performing 10 selects one-by-one.

The key here was to make sure that both Mnesia and Postgres behave the same way and hence lots of tests, including property based tests, were written to ensure that both performed equally.

Introduction of types
Mnesia stores untyped records (tuples) whereas Postgres requires types to build its columns. At the very beginning of this journey we simply created a key-value Postgres where both the key and value were a term_to_binary/1 representation. The key being the actual key and the value being the entire record. Naturally this was limiting to us in many ways but one of the main limitations was that match specifications could not be executed, and in order to do any kind of filtering, the entire table contents had to be brought back into memory for Erlang side filtering. Hence, the journey began to add (and enforce!) types to all of our records.

This was achieved by iterating the full 15+ TB (at the time) database and for each field in each record, determining its type. Each time a new type was discovered for a particular field, we combined these types in order to determine a type that would fit both of these types. Let’s look at a few examples to make this clear.

Given the following three records:

#purchase{id = 1234, amount = 100, description = "shoes"}
#purchase{id = 123456, amount = 200, description = ""}
#purchase{id = 12345678987654321234, amount = 300, description = undefined}

Given that we iterate all 3 records, we can see that an id of the first record can be placed in an int2 as it is within the range of -32768 to +32767. The second record can be placed in an int4 range (-2147483648 to +2147483647). At this point we would upgrade it’s type to int4 since both id’s would fit in this type. The fourth record however would only fit in a numeric datatype so we had to upgrade the #purchase.id field to a numeric datatype in order to fit all possible ids.

The amounts were more straightforward as all of them would fit in an int2 range. Considering that amounts however are somewhat arbitrary, we decided that such fields should be upgraded to a higher int range in order to not run into trouble for example an int8 in order to allow really large purchases to not crash the database. Since amounts always have to be present this would end up being a non nullable int8.

The description is more interesting again as here we see three types, a string, an empty string and the atom undefined. In this case we can treat undefined as a null value and hence this would result in a nullable string.

Erlang however does not have a null concept and epgsql treats undefined as nulls. KRED was inconsistent in treating undefined as if it were a null value. In KRED null, undefined or [] were all used to indicate a null value in records. Eventually, we ended up having to treat each of these as nulls and keeping track for each field (if nullable) what its null value should be. On the way to Postgres, we could transform the null value to undefined in order to store a null value and on its way back from Postgres transform it back to the null value that the application is expecting.

Replication from Mnesia to Postgres, and back again
One of the key aspects to this migration was the ability to switch back and forth between Postgres as leader and Mnesia as a leader. This meant that replication between Postgres ↔ Mnesia had to work in both directions. We wanted to have the ability to switch back to Mnesia if we saw any unforeseen issues with Postgres as the leader. Sounds easy enough on paper but it turned out to be quite the headache in practice.

The first step was to implement a replication mechanism from Mnesia to Postgres. Considering we already had a transaction log published out of KRED which Mnesia followers used to update their database with, this initially seemed fairly straightforward. As everything in this project, it turned out not to be. One of the problems we encountered was the added latency of Postgres caused the replication to be lagging and not keeping up with the sheer volume of transactions coming out of KRED.

We ended up building a pipelined replication mechanism where a worker process was spun up which was responsible for building up and flushing a batch of transactions. A worker took transactions and batched them up. Once a configurable batch size was reached it would start flushing the transactions to the database.

Two distinct modes were available for the replication mechanism: high throughput mode, and low latency mode.

In the high throughput mode, we would give up consistency of the database in favor of import speed. The node would not accept any traffic and all it had to do was import transactions as fast as it could in order to catch up again.

Once caught up the node would automatically switch back to low latency mode and traffic would be allowed on this node again. In low latency mode two strategies existed.

One strategy was to split big transactions into several smaller ones and handle them in parallel. This was possible since the Mnesia replication did a similar thing where a single transaction would be committed on a per table basis rather than as a single snapshot. With this strategy, all operations on a single table were committed on their own rather than as a single unit.

The second strategy was to batch several smaller transactions into a bigger one and handle them serially. This allowed us to use multi-delete and multi-write as well as get rid of obsolete operations. Ergo, if a record was written twice in quick succession, only the last write would be written. Similarly, if a record was written but subsequently deleted, only the delete would be performed.

During low latency mode, workers would maintain a record-level locking on the Erlang-side (local to the node). Only those records where the worker could acquire a lock would be sent to the database. Records where no lock could be acquired, would be sent once a lock was released by one of the previous workers. Only once all locks were acquired and all database operations were performed, would the transaction be committed.

With all this in place, we were able to serve a Postgres follower with about a 20ms lag. A totally acceptable amount of lag in our case.

Now that we had replication from the Mnesia leader to the Postgres follower in place, what would come next? We obviously also needed replication from Postgres to Mnesia for when it was time to take over as a leader role. As previously mentioned it had to be possible to switch back and forth between an Mnesia leader and Postgres leader. This had to be possible with no downtime as switches happen regularly during maintenance or on the rare occasion due to server failure.

Two methods of achieving this were put forward as possible candidates. First, using logical replication where we could change the Mnesia followers to use logical replication in order to obtain a stream of all database events that happened. Essentially mimicking the replication mechanism that we had in place for Mnesia to Mnesia nodes.

Another method would be to use prepared transactions and transmit a transaction log across the cluster after having prepared a transaction using rpc calls just as before. Since, once a transaction has been prepared, it is extremely unlikely to fail. Again, our Mnesia based replication used a similar methodology so whatever worked for the past 15 years, surely would work for a transition period of a few months at most.

Both of these approaches came with their pros and cons. Logical replication would result in us having to rewrite large parts of the replication logic. An area of the code that has been fairly stable for many years without requiring a lot of attention. Changing this up 180 degrees would be very risky. Luckily it would have minimal impact on the Postgres leader node. Using prepared transactions would have little to no impact on the Mnesia followers while instead the performance penalty on the Postgres leader would be considerably higher.

In the end we opted for using prepared transactions due to its minimal risk as well as being close to the current method that was used for Mnesia replication. Surely you can see a trend here, can’t you?

Validating the replication
Ok, so we had a replication mechanism in place between Mnesia and Postgres. But how do we ensure that this replication works as intended? Naturally, we had plenty of tests to cover for this, including a bunch of tests that ran an entire cluster of KRED nodes to ensure that everything worked as expected in a clustered fashion. But we wanted more than that.

In the early days, we regularly compared our Mnesia backups with Postgres snapshots (each DB, row-by-row). Similar to pgverify which can compare CockroachDB with Postgres. Considering the size of the database, this operation took about 7 days before we’d get back a yay or nay on whether or not anything went awry.

What if we could validate the replication by keeping track of all the keys that were written, updated or deleted and ensuring that the same state on the leader exists on the followers? Well, considering that all our database operations were going through the kdb layer it was trivial to record the keys in a log and have a separate process that would rpc across the cluster to determine that the state ends up converging and being the same.

So given a transaction that performs the following operations:

write(table1, key1, value1)
write(table2, key2, value2)
delete(table1, key3)
update(table2, key4, newvalue4)

We logged the following structure to a disk_log:

[ {key1, table1, write, Timestamp}
, {key2, table2, write, Timestamp}
, {key3, table1, delete, Timestamp}
, {key4, table2, write, Timestamp}
]

We would then read the log in real time from another process and use rpc:multicall to check with all nodes across the cluster that the two writes as well as the update are propagated correctly and the same value existed in the database and the delete is propagated and the record should no longer be present anywhere.

As it turned out, this process was extremely good at finding false positives. Jokes aside after some tweaks, we had this running for almost a year before our final switch and it has helped us with the confidence needed that all databases (Mnesia replicas) as well as Postgres were always up to date and equal to each other. The feeling of confidence was well worth the few false alarms we got from it.

The final switch

After having run as the leader for some months, the day has finally come to switch off the Mnesia nodes and fully run on Postgres. All the runbooks for any possible foreseeable disaster were in place and we were all ready. We gathered a bunch of KRED teams into a Google Hangout and started the shutdown of the Mnesia nodes. Can you spot the timestamp when all Mnesia nodes were shut off on the below graph?

No? Well neither could we. The final shutdown of all Mnesia nodes happened at 10:00:41 to be precise. No change in traffic was observed and everything kept on sailing smoothly. Honestly, after 3 years of work this was probably among the most anticlimactic changes we’ve ever made.

In this context anticlimactic was a good thing! We had officially migrated from Mnesia to Postgres with zero downtime! Time to party? Certainly! And what a party that was! But… we had one itsy-bitsy bit of cleanup to do, namely we really wanted to get rid of prepared transactions.

The last migration

There are two methods for exporting data out of KRED. First one, there’s a specific node that exports specific tables to a Postgres database for business intelligence purposes. This node was destined to remain on Mnesia due to the massive rewrite it would take to move this specific node over to Postgres too.

The second method is exporting domain events in the form of Avro messages exported to Kafka. Both of these methods of exporting data use the transaction log, each in their own way. For event emission we used a specific data structure that was inserted into the transaction log which was used to atomically append events for a specific transaction to the transaction log.

Remember how we leveraged two phase commits in order to publish a transaction log across the cluster? Well considering the performance penalty of this, we wanted to move away from this as soon as possible and use plain logical replication instead.

One slightly unknown feature in Postgres was introduced in Postgres 9.6 (3fe3511d) that allows passing arbitrary messages in either text or bytea format to logical decoding plugins via WAL. In essence, this would allow us to inject the transaction log entry into the WAL log and migrate the above two use cases to simply pick that message from a logical decoding stream and allow us to stop using prepared transactions. Testing showed that we could shave off roughly 20ms from each database transaction by moving away from prepared transactions so this was a price well worth paying. Unfortunately the pgoutput plugin only added support for this in Postgres 14 (ac4645c) so an upgrade of our Aurora Postgres from 13 to 14 was required.

Another change was required with the Erlang pgoutput decoder plugin in order to support this new message format. Once in place, we could insert arbitrary transaction logs into Postgres WAL which could replace the transaction log that was replicated over Erlang RPC and subsequently remove our need to use prepared transactions.

Final notes

A refactor of this size is certainly not a trivial one. There’s plenty more that cannot be covered in this blog or else I’d be taking way too much of your time. There are a few key takeaways though that certainly can hold true and can help you tackle a project of this size.

Create a proof of concept — It may sound trivial but getting early feedback whether or not something even remotely appears possible is key here. It doesn’t have to be pretty, it just has to compile (if applicable) and “kinda run”. Whatever code you write for this PoC, consider it throwaway.
Break it down into bite sized chunks — Sure, you’re not gonna know all the bites you’re gonna have to bite through, but it’ll help put things into perspective. Ensure that each bite sized chunk is a positive change to the codebase. The further you go, the more of a “point of no return” may be acceptable but I’d argue that at least 75% of our changes were changes that would prove useful regardless of whether or not we switched to Postgres in the end.
Solid testing strategy — I’ll dive more into this in part 2 of this blog post but every single change we made, was tested and could be proven to be correct. We had plenty of metrics and tests to ensure our business code was kept intact and performant. We also had real-time monitoring to ensure that nothing broke along the way.
Do not stop — Sh*t happens and it’s ok. Hardly any problems are unsolvable. But with the right attitude, patience and knowledge, projects such as these can be turned into a great success story!

In part2 we will focus on database isolation levels and how we managed to make Postgres behave like Mnesia and how we ensured our implementation of serializable on top of Postgres met the requirements for serializable isolation level.

The fellowship of the forgotten was originally published in Klarna Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Automating the Klarna Card Ownership Fees System using AWS Step Functions

Michel Neumann — Thu, 02 May 2024 07:42:27 GMT

This article outlines how my team and I applied automation using AWS Step Functions and CloudFormation on a system to charge the monthly fee for Klarna Cards, enabling us to transform a previously manual routine into a self-sufficient, scheduled workflow. The initiative significantly streamlined operations and reduced maintenance cost.

Introduction

In early 2023, Klarna introduced monthly fees for Klarna Cards in the US. In the Card & Banking domain, two teams, including myself as engineer, developed this system within a tight four-month deadline. Initially, the system required extensive manual operation, including a detailed checklist for engineers to follow to ensure successful executions. The teams launched, planning iterative improvements of that routine.

Months passed by without any advancement in refining the operation process nor automating any part of it. To provide an overview of what needed to be done by the teams to run the batch jobs:

Designating an engineer to lead the monthly process, coordinated using JIRA tickets
Updating exemption lists and submitting pull requests to the code-base prior to initiating batch runs
Ensuring data integrity by performing Athena queries across three different production databases within the AWS Console
Manually initiating multiple batch jobs in a specified sequence with manual input of arguments in a live production setting
Awaiting termination of the jobs and conducting a thorough review of the outcomes of the final batch jobs for each market

Overall, this routine took around three business days per month and required two engineers to approve code changes and review the results of the batch runs. Considering new markets where fees may be rolled out towards, this workflow posed a significant challenge to maintaining high-quality standards and preventing potential incidents.

Taking on The Challenge

We recognized that continuing with our current process was unsustainable and bound to cause issues down the line. When the topic got priority, I took the opportunity to lead the initiative, dedicating my time to investigate the problem and propose a viable solution.

I created a “Request for Comments” (RFC) document, a formal method employed by Klarna for proposing ideas, to outline potential solutions and gather ideas. During this process, discussions and constructive feedback were shared. We concluded on committing to AWS Step Functions, a service which allows orchestrating multiple services into server-less workflows.

To manage expectations and set project milestones, I developed a detailed timeline. I estimated the completion of different phases of the automation project, including the initial MVP and the final implementation. Additionally, I outlined a series of implementation and discovery tasks to be integrated into upcoming sprints.

System Overview

For better context, I am going to describe the Ownership Fee system as it operated before the introduction of automation. The system is distributed over two AWS accounts: one directly owned by the team and another shared account with resources used by the Klarna App.

Architecture diagram of the Ownership Fees before automation

CloudFormation and AWS CDK are utilized to deploy resources. Most notably are two Glue jobs and two SQS queues with one of them being a “Dead Letter Queue” (DLQ), containing messages that could not be processed. The Glue jobs are reading data from three distinct databases, each managed by a separate domain and housed in yet another AWS account.

A Python script running PySpark will preprocess, join, and transform the data to generate JSON records, each describing a customer’s card information for a given month. These records are written into the SQS queue and processed by an AWS Lambda function, which is deployed in the shared AWS account. The Lambda function implements a decision tree which results in either triggering a fee charge using a dependency system or an exemption for the customer from the fees for the current month. Decisions made by the Lambda function will be stored in a DynamoDB table within the same account.

Adding Automation

To implement automation we created a so-called “state machine”, which describes a sequence of event-driven steps where each step in the workflow is called a “state”. A state represents a unit of work that can call any AWS service or API. In our case these states would consist of Lambda functions and Glue job triggers to solve the issues of the manual routine mentioned previously. Finally, we will need to add a scheduler to trigger the workflow on a predetermined interval to avoid having an engineer to invoke the system manually by logging into AWS Console and launching the state machine.

Leveraging the existing use of AWS CloudFormation, a new Stack using CDK has been created, exclusively containing automation resources. A Stack describes a modular grouping of AWS resources that can be independently managed and altered without impacting other stacks. In this stack, the state machine has been defined.

import { Stack, StackProps } from 'aws-cdk-lib'
import { StateMachine } from 'aws-cdk-lib/aws-stepfunctions'

export class KlarnaCardOwnershipFeesAutomationStack extends Stack {
  constructor(scope: Construct, id: string, props: StackProps) {
    super(scope, id, props)
  }

  private provision() {
    const stateMachine = new StateMachine(this, 'state-machine', {
      stateMachineName: `klarna-card-fees-automation`,
      role: stateMachineExecutionRole,
      definitionBody,
    })

    // More resources will follow here...
  }
}

To integrate the execution of Lambda functions into the workflow, “LambdaInvoke” tasks needed to be created.

Because the Lambda is defined in the same CDK stack, we were able to reference it directly. However, it is also possible to invoke Lambda functions that are deployed elsewhere, for example by using their resource ARNs.

import { Code, Function, Runtime } from 'aws-cdk-lib/aws-lambda'
import { LambdaInvoke } from 'aws-cdk-lib/aws-stepfunctions-tasks'

const lambda = new Function(this, 'lambda', {
  functionName: `klarna-card-fees-lambda`,
  code: Code.fromAsset('...'),
  handler: 'index.handler',
  runtime: Runtime.NODEJS_20_X,
})

const lambdaInvoke = new LambdaInvoke(this, 'step-invoke-lambda', {
  lambdaFunction: lambda,
  resultSelector: {
    FeePeriod: JsonPath.stringAt('$.Payload'),
  }
})

To pass arguments in between states, tasks support specifying input and output selectors. This is a feature of the “Amazon States Language”, a JSON-based, structured language used to define state machines. The return value of the Lambda function (“Payload”) will be passed to the next state as a JSON record of the following format:

{
  "FeePeriod": "2024-01"
}

Up next, the steps to trigger the Glue jobs were defined. The tasks reference the Glue jobs by their name. It is important to note that both the state machine and the Glue job must be deployed in the same region, as there is no support to invoke them across regions or accounts as of time of writing.

import { GlueStartJobRun } from 'aws-cdk-lib/aws-stepfunctions-tasks'

const startGlueJob = new GlueStartJobRun(this, 'step-fn-start-glue-job', {
  glueJobName: 'klarna-card-fees-glue-job', 
  integrationPattern: IntegrationPattern.RUN_JOB,
  arguments: TaskInput.fromObject({
    '--FEE_PERIOD': JsonPath.stringAt('$.FeePeriod'),
    '--MARKET': "US",
  }),
})

Accessing the state input, which in our case was returned by the previous Lambda task, can be achieved through a syntax known as “JSONPath”. With “$” pointing to the root of the input object, the “FeePeriod” property is selected, read as a string and passed as a Glue job run argument.

By setting the “integrationPattern” to “IntegrationPattern.RUN_JOB”, the execution of the Glue job is handled synchronously, meaning the state machine will pause until the Glue job has terminated.

Linking the previously defined steps together, concluded with a “Succeed” step to mark the execution as successful, forms the basis of the so-called “definition body” of the state machine.

import { Succeed } from 'aws-cdk-lib/aws-stepfunctions'

const definitionBody = DefinitionBody.fromChainable(
  lambdaInvoke.next(
    startGlueJob.next(
        new Succeed(this, 'automation-complete')
      )
    )
  )
)

Due to the “least privilege principle” of the AWS Well-Architected framework, the state machine itself lacks permissions to execute the Lambda functions and Glue jobs. In alignment with best practices and to ensure the state machine operates within its required capabilities, an AWS IAM role has been created that allows the missing permissions. This role was added to the state machine’s configuration.

const stateMachineExecutionRole = new Role(this, 'state-machine-execution-role', {
  assumedBy: new ServicePrincipal(`states.us-east-1.amazonaws.com`),
  roleName: `state-machine-execution-role`,
  inlinePolicies: [
    new Policy(this, 'state-machine-execution-role-policy', {
      statements: [
        new PolicyStatement({
          effect: Effect.ALLOW,
          actions: ['lambda:InvokeFunction'],
          resources: [`arn:aws:lambda:*:${account}:function:klarna-card-fees-lambda`],
        }),
        new PolicyStatement({
          effect: Effect.ALLOW,
          actions: ['glue:StartJobRun'],
          resources: [`arn:aws:glue:*:${account}:job/klarna-card-fees-glue-job`],
        }),
      ],
    }),
  ],
})

With all components configured, deploying the stack is as simple as executing the “cdk deploy” command, after which the resources will be set up. Accessing the state machine is straightforward through the AWS Console under the “Step Functions” section.

Step Functions module in the AWS Console

A scheduled trigger to enable full automation is yet missing. For this, AWS EventBridge provides a solution to set up a CRON scheduler, which automatically triggers the state machines at specified intervals. An additional IAM role is required for the scheduler to have the necessary permissions to activate the state machine.

import { CfnSchedule } from 'aws-cdk-lib/aws-scheduler'

const schedulerExecutionRole = new Role(this, 'scheduler-execution-role', {
  assumedBy: new ServicePrincipal('scheduler.amazonaws.com'),
  roleName: `klarna-card-fees-scheduler-role`,
})

const schedule = new CfnSchedule(this, 'automation-schedule', {
  name: `klarna-card-fees-automation-schedule`,
  scheduleExpression: 'cron(0 9 3 * ? *)', // Every 3rd at 9:00AM UTC
  state: 'ENABLED',
  target: {
    arn: stateMachine.stateMachineArn,
    roleArn: schedulerExecutionRole.roleArn,
  }
})

stateMachine.grantStartExecution(schedulerExecutionRole)

By configuring the CRON schedule, enabling it, and designating the state machine as the target, the setup is complete. Following a redeployment, automation is fully in place, allowing the system to operate independently. To get notified about issues during the execution, we created custom monitors on Datadog with integration towards OpsGenie, to call-out an engineer if unexpected errors occur.

A CRON scheduler has been created using AWS EventBridge, targeting the state machine

A complete architecture diagram showcasing the addition of the entire automation framework is provided below.

Complete architecture diagram including automation

Results

What were the advantages of solving this problem? Primarily, it led to a 12.5% monthly reduction in the team’s workload, freeing up resources for additional projects and initiatives. This decrease in workload translates into lower operational costs, as there’s no longer a need to dedicate engineer hours to system operation. By automating processes, the risk of human error was minimized, thereby reducing the likelihood of potential incidents.

Additionally, the team had the chance to reduce tech debt, reworking IAM roles by eliminating unnecessary permissions and improving the staging environment, enabling thorough testing of the automation process before its rollout to production.

I personally gained a lot of knowledge during this project which I gave back to my team and domain by creating documentation and hosting presentations to deep-dive into the technical aspects and challenges. Finally, working on this project greatly enhanced my proficiency with AWS, which ultimately led me to successfully acquire the AWS Solutions Architect Associate certification!

Summary

In conclusion, leveraging AWS Step Functions for automating tasks is highly beneficial for those who manage system components within AWS accounts that require regular execution or need to follow a specific sequence. This blog post has demonstrated the functionality of AWS Step Functions through a straightforward example. Yet, it is possible to architect more complex workflows, invoking other AWS resources, run processes in parallel, incorporate error management strategies, and much more.

A big thanks to all contributors and to my team for their extensive support, providing me the opportunity to drive this project!

Did you enjoy this post? Follow Klarna Engineering on Medium and LinkedIn to stay updated on more articles like this.

Automating the Klarna Card Ownership Fees System using AWS Step Functions was originally published in Klarna Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Peak Season 2023 : How Klarna achieved consistent success

Anu Sasidharan — Fri, 01 Mar 2024 08:52:56 GMT

Peak Season 2023 : How Klarna achieved consistent success

Introduction

This article summarizes how Klarna consistently achieved its Peak Season goals during 2023.

In 2023, we aimed higher. We were committed to continual improvement, building on the progress made in 2022. Our primary focus was to deliver the best possible experiences to our customers. During the Black Friday sale, we broke our own records, a testament to our commitment and capability.

1. Klarna’s systems showed resilience with zero critical or major incidents, and we saw a 30% reduction in overall incidents compared to 2022.

2. We optimized the management of resources, leading to optimal cloud costs.

3. We paid special attention to our engineer’s experience, ensuring a smooth Peak Season preparation. This improved efficiency, and lessened the workload.

The topics this article covers:

Peak Season at Klarna
Factors contributing to our success
Approaches to Peak Season essentials
Lessons learned

Peak Season at Klarna

The most important time of the year for Klarna is the ‘Peak Season.’ It starts with the busy week of Black Friday and ends with the sales at the end of the year. Peak Season gets busy because of Holidays and Festivals, Sales and Discounts, and seasonal necessities. During this time, e-commerce and fintech companies buzz with heightened activity, with Klarna being a significant player amidst them.

Why is the Peak Season important for Klarna?

Klarna’s mission is to give shoppers around the world easy, safe, and ‘smoooth’ ways to pay. We handle an average of over 2 million purchases every day, serving over 150 million active shoppers at more than 450,000 sellers in 45 countries. On Black Friday, we handle more than 3 times the usual daily purchases.

Big sales events, especially flash sales, mean our systems have to manage a lot of traffic. Flash Sales are quick discounts or promotions from stores that get buyers excited. This rush adds to the already busy season, and our systems have to quickly handle more than 8 times the usual activity. We have even seen this go up to 40 times for a big merchant, all while still providing customers with a delightful experience.

What were the Key challenges in getting ready for the Peak Season?

Klarna is a large organization with many teams working in different domains and functions. For the Peak Season, all systems directly impacting the shopping frenzy and the systems that are supporting these frontline systems efficiently, were involved. This included over 450 systems, 180 teams, and 17 domains prepared for the Peak Season.

The challenges were:

1. Making sure the 450+ systems were ready to manage the increased loads smooothly, especially during flash sales,

2. Ensuring the security of these systems from potential attacks,

3. Aligning key decisions and strategies across the teams,

4. Communicating decisions, timelines, processes, and best practices effectively.

Key Success factors

Klarna’s decentralized structure lets teams work together and make decisions quickly, helping respond to the market’s needs. This focus on innovation and customer needs ensures smoooth shopping experiences.

For Peak Season management, we have implemented an efficient organizational structure designed for our distributed environment, with clearly defined roles and responsibilities. Good teamwork, particularly central coordination, was integral to our success during the 2023 Peak Season.

Teams/System Owners, being the key players, maintained their systems to meet the required standards.
The Continuous Readiness Team centrally coordinated the necessary preparations by making appropriate decisions and developing effective tools, processes, and practices to ensure operational readiness.
The Steering Committee, which includes Yaron Shaer (our CTO), Domain Leaders, and Architects, oversaw and approved the Continuous Readiness Team’s decisions.
Domain Readiness Leads , took the lead in their areas and provided useful feedback to the Readiness Team, helping to identify and handle potential issues early.
Business Developers worked closely with merchants to give important flash sale details.

Klarna’s Engineering Platform laid a solid foundation for the System Owners to ensure optimal performance of their systems. In 2023, we focused on effectively distributing responsibilities across team, domain, and central levels, designing Peak Season requirements, and creating tools for compliance checks.

We have specified two types of readiness configurations: non-negotiable requirements, which are measured objectively, and team-managed configurations, which are evaluated subjectively under the oversight of the respective domain.

We implemented a readiness tool which automatically checked non-negotiable compliance requirements. This tool alerts System Owners to maintain alignment with peak season and continuous requirements. It employed 56 rules examining system readiness across categories such as resilience, availability, databases, performance, and system capacity. Readiness dashboards enabled the Central Team to monitor and ensure system alignment. The tool assigned readiness scores to systems, teams, groups, domains, and Klarna as a whole. Our goal was to achieve a 100% score by October 31, 2023, thus ensuring readiness while allowing for unplanned contingencies.
Readiness reviews were conducted to prevent overlooking any aspect, especially subjective assessments. System Owners underwent a comprehensive checklist review, which domain architects approved. Critical systems, which form the backbone of the purchase flow, were reviewed by a central team of architects and engineering leaders.
Under the oversight of the experts from our cloud providers, We conducted DDOS fire drills on publicly exposed systems to identify potential vulnerabilities that could lead to attacks during peak times.

We’ve empowered our teams in the following ways:

Traffic Capacity Predictions: We introduced Kapacity (Klarna Capacity), our in-house tool, to help teams project minute-by-minute request volumes using past data and merchant predictions. Kapacity provides growth metrics and extra capacity for unexpected increases, allowing System Owners to easily access data, estimate incoming requests, and make informed decisions about resource allocation. This resolves a pain point from 2022, when more than half of our systems had to make independent predictions based on central metrics. Now, with Kapacity, we offer predictions for every system and service.
Performance Testing Tools & Framework: Customized to Klarna’s engineering needs, our centralized performance testing framework streamlines the development, building, and execution of performance tests, ensuring our services can handle heavy loads. System Owners are guided by comprehensive best practices to guarantee a consistent user experience and confirm that their test parameters fulfill well-defined SLO requirements. The tool is capable of monitoring the success of tests conducted by the systems.
Best Practices & Guidelines: We provide guidance in several areas including Lambda Readiness,Databases, handling dependencies (internal and third parties), capacity reservations, monitoring, observability, runbooks, etc.

A Closer Look at Peak Season Essentials

For the inquisitive readers who are keen to delve deeper into the approaches for the key preparations for Peak Season, this section is designed with you in mind. Some of these topics have been touched upon in previous sections, yet here we deep dive into the nitty-gritty aspects.

To guarantee a seamless and delightful customer experience during the 2023 Peak Season, preparations and detailed planning were undertaken, focusing primarily on the following essential areas:

1. Comprehensive Performance Testing

2. Proficient Management of Flash Sales

3. DDOS Readiness preparedness

Approach for Performance Testing: Understanding the importance of system efficiency in handling elevated traffic, particularly during flash sales, extensive performance testing has been carried out in alignment with Klarna’s specific requirements. Based on direct traffic dependencies, the distinctive nature of flash sales, and trends observed from historical data, the Klarna Infrastructure capacity has been segregated into two levels:

FLAS Capacity (Fast, Large Spike): Designed to accommodate sudden, substantial surges of activity often triggered by flash sales, campaigns, or incidents. In such scenarios, traffic can spike drastically within a two-minute duration, necessitating a system robust enough to manage these increments without relying on auto-scaling, which might be too slow to react.
Baseline Capacity: This constitutes the maximum capacity needed to support the regular daily traffic (excluding FLAS events). It’s structured to comfortably endure peak daily volumes, boasting an automatic auto-scaling functionality for consistent performance.

The systems anticipated to face FLAS events have undergone load tests (where performance is gauged against expected loads), spike tests (assessing the system’s handling of sudden load increments, typically due to flash sales), and overload tests(designed to ascertain the load point at which a system fails or exhibits significant degradation). Conversely, the Baseline systems only needed to conduct load and overload tests.

Moreover, it’s integral to test underlying components like dependencies and databases — this ensures comprehensive system performance.

Failover tests play a critical role too. Failover is an automatic switching from the primary system to a backup system, initiated when a fault or failure is detected. Swiftly configuring databases and their corresponding clients for rapid failover is crucial for maintaining system resilience and overall performance, especially during unexpected events. For instance, if a sudden traffic surge occurs and the database isn’t primed for quick failover, notable slowdowns may transpire, or the system might become completely unresponsive. Such a scenario could culminate in downtime, potentially compromising data integrity or inciting data loss if handled incorrectly. Equipping your configuration with fast failover is thus a priority to avoid such disruptions, further ensuring a seamless user experience even under significant loads.We employed strategic retries, exponential backoffs, and robust exception handling. Additionally, we use specified query timeouts to maintain optimal speed for both indexed and non-indexed lookups. These measures all contribute to smoother recovery of database operations, thereby preserving application performance integrity.

Approach for Managing Flash sales : Flash sales management is divided into two categories: Managed Flash Sales and Unmanaged Flash Sales.

Managed Flash Sales are reported through a dedicated process by a Business Developer, who has direct contact with the Merchant. This report triggers an automated notification to the Continuous Readiness team. The team subsequently reviews the merchant’s historical data and projected peak capacity. The information is then compared against the existing rate limit for the relevant merchant category. This evaluation assists in managing intense sales activities during periods when partners are anticipated to exceed their standard rate limit. This vital procedure protects the Klarna platform from potential overloads that could negatively impact all partners. If a rate limit adjustment is needed due to a flash sale, the Continuous Readiness team requests a temporary rate limit increase from a central team which manages rate limit for all merchants. This decision is based on Klarna’s set capacity levels and a comprehensive understanding of the situation. Impacted teams, including Accountable Leads and On-Call members, are subsequently notified. The Business Developer or the Key Account Manager who reported the Flash Sale, along with the merchant’s solution engineer, is also informed. All pertinent stakeholders join a direct message group where they receive further details about the flash sale, such as the schedule, impacted countries, expected peak times, and other important considerations. Significant flash sales are monitored centrally to provide continuous support during the event.
On the other hand, Unmanaged Flash Sales are unpredictable. This means that any sudden load during these sales is managed by the FLAS capacity predictions provided by the Kapacity (prediction tool), which also includes a safety buffer. During peak season, particularly on Black Friday, continuous central monitoring is in place to ensure prompt support should any issues arise. To further ensure operational continuity, Technical Managers from our cloud providers were also on standby for support.

Approach for DDOS readiness : Our procedure for ensuring DDOS readiness involved enabling central platform level protection around several aspects in addition to the fire drill conducted for identified public endpoints. This exercise was led by Case Taintor (Competence Group Lead of Klarna Engineering Platform), in collaboration with our networking, security teams and cloud providers. The fire drill was planned with the aim of boosting confidence, identifying areas for improvement, validating processes, and gaining practice.

The fire drill exercise included a simulated synthetic test, reviewing setups such as WAF rules, and providing targeted advice. This drill aided system owners in identifying necessary actions on CloudFront configurations, alerting mechanisms, and origin setups. Moreover, it helped update essential checklists and procedures in the Runbook.

Lessons Learned: Aspiring for Continual Growth

One of the leadership principles that resonates profoundly with me is ‘start small and learn fast.’ At Klarna, we have the daring to experiment with cutting-edge technology trends, which accelerates our learning and fosters innovation. The accomplishment of each Peak Season is a tribute to the collaborative efforts of all the teams within the readiness scope. Our most recent Peak Season has set a new standard, thanks to its unprecedentedly high-quality delivery.

In the future,

We aim to further enhance our engineers’ efficiency by lessening the efforts needed for Peak Season preparations as well as in other operations throughout the year.
We will emphasize customer delight as the cornerstone of Klarna, rooted in our customer-centric philosophy. Therefore, persistent performance testing is of utmost importance, ensuring the reliability and optimum performance of Klarna’s systems. We currently boast a suite of impressive tools, and our goal is to foster a culture that encourages performance-driven development.
As the premier AI-powered bank, we are committed to leveraging AI to enhance efficiency and maintain an unwavering focus on quality.
We also anticipate increasing our dependency management with third-party systems to further augment readiness.

In conclusion, Klarna’s Peak Season 2023 was a game-changer, setting new standards in planning, preparedness, and execution. Excellence in Peak Season is no more a goal, but a standard that Klarna strives to elevate with each passing year, promising a future of seamless shopping experiences for our customers. In the face of ever-evolving challenges, Klarna’s steadfast commitment to customer delight, championed by a spirit of continual growth and innovation, continues to guide us towards stellar trajectories of success.

Did you enjoy this post and want to stay updated on our latest projects and advancements in the engineering field? Join the Klarna Engineering community on Medium, Meetup.com and LinkedIn.

Peak Season 2023 : How Klarna achieved consistent success was originally published in Klarna Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stop Misusing ROC Curve and GINI: Navigate Imbalanced Datasets with Confidence

Angel Igareta — Thu, 09 Nov 2023 09:22:53 GMT

Discover how the Precision-Recall curve can provide a more robust metric for binary classification in data science and machine learning.

Imagine stepping into the complex world of binary classification problems. As a Senior Data Scientist at Klarna, this is my day-to-day reality. Binary classification is a cornerstone of data science, with applications touching everything from credit default predictions to medical diagnoses and spam detection. Yet, these problems come with their own unique set of challenges.

Metrics such as the GINI coefficient and ROC_AUC often serve as our compass in this maze. They are widely trusted and used for evaluating models. But here’s the catch: they might not always point us in the right direction. Can we rely on them blindly, or do we need to dig deeper?

The path gets even more challenging when we encounter imbalanced datasets. In such cases, the effectiveness of our trusted metrics can be seriously compromised.

In this post, I invite you to join me on a journey to explore these metrics in greater depth. We will question their effectiveness, understand their limitations, and reveal alternatives that could prove to be more reliable navigational tools in the world of binary classification problems.

Understanding Model Predictions and Metrics

To truly grasp the nuances of model evaluation, let’s start by setting the stage with a real-world scenario that we often encounter at Klarna.

Imagine we’re tasked with predicting customer loan defaults. We have two categories to consider — paid or default. However, in our scenario, the default rate is a mere 2%. This is a classic case of data imbalance, and it’s exactly the kind of challenge we’re up against.

To evaluate our model’s performance in this scenario, we need to understand its predictions. We break these down into four distinct outcomes, also known as the confusion matrix:

True Positives (TP): These are the customers who our model correctly identifies as defaulters.
False Positives (FP): These are the customers who our model incorrectly flags as defaulters. Such errors can result in losses of customer lifetime value.
True Negatives (TN): These are the customers who our model correctly identifies as non-defaulters.
False Negatives (FN): These are the customers who our model incorrectly flags as non-defaulters. Such errors can lead to financial losses due to the average loss and recovery rate.

With these categories in place, we can delve into the metrics we use to measure our model’s performance. In real-world applications, Data Scientists often aim to provide the best model within certain constraints, such as a range of risk profiles (like default rates). They don’t fixate on a single threshold. The Underwriting teams then pick a threshold based on the company’s current risk appetite and objectives over a given period.

Hence, we’ll skip the popular measures like Accuracy or F1-Score and instead we’ll focus on threshold-agnostic metrics, which gauge the model’s ability to distinguish between categories, regardless of the chosen threshold.

ROC Curve

Let’s delve deeper into one of the most widely used threshold-agnostic metrics — ROC_AUC, which stands for Receiver Operating Characteristic Area Under the Curve. This metric uses a graphical representation to provide a comprehensive view of our model’s predictive capabilities.

The ROC curve plots True Positive Rate (TPR) on the y-axis and False Positive Rate (FPR) on the x-axis. This gives us a clear view of the trade-off between TPR and FPR at different thresholds of the confusion matrix, offering a holistic understanding of our model’s performance across varying cutoff points.

ROC Curve Illustration: Comparing Classifier Performances from Best to Worst

But what exactly are TPR and FPR? Let’s break it down:

True Positive Rate (TPR): This is the proportion of actual defaulters that our model correctly identifies. Mathematically, TPR is calculated as TP / (TP + FN). Another way to think about it is, “If a customer defaults, what’s the chance our model will catch it?”
False Positive Rate (FPR): This is the proportion of actual non-defaulters that our model mistakenly identifies as defaulters. Mathematically, FPR is calculated as FP / (FP + TN). You can interpret it as, “If a customer didn’t default, what’s the likelihood our model incorrectly marks them as a defaulter?”

The area under the ROC curve gives us the ROC_AUC score. A higher score indicates that our model is not only accurate in its predictions (high TPR) but also minimizes false alarms (low FPR) across all thresholds in the confusion matrix.

GINI Coefficient

Derived from ROC_AUC, the GINI coefficient provides a simple yet insightful measure of a model’s performance. It ranges from 0, which signifies a model with no discriminative power, to 1, indicating perfect discrimination between classes. The formula for calculation is: GINI = 2 * ROC_AUC — 1.

For stakeholders, the GINI coefficient offers a quick, single-figure snapshot of model effectiveness. Unlike the ROC_AUC, which requires a deeper understanding of true and false positives and negatives, the GINI coefficient provides a more immediate sense of model performance, making it a popular choice for non-technical audiences.

However, both the ROC_AUC and the GINI coefficient share a common pitfall.

While they excel in evaluating models for balanced datasets, they can be misleading when dealing with imbalanced datasets. This is because they don’t take into account the ratio of the positive and negative classes.

Consequently, these metrics can yield a high score even when the model is performing poorly on the minority class. In an unbalanced situation, a model may resort to always predicting the majority class to achieve a high score. This strategy, though it inflates the performance metrics, fails to provide any meaningful insight into the minority class, which could be critical in scenarios like fraud detection or rare disease diagnosis where the minority class is of greater interest.

Yet, there’s no need for concern. We have a remedy for this situation: Enter PR_AUC. This metric provides a more reliable evaluation for imbalanced datasets, which we’ll explore next.

Precision Recall Curve

Let’s delve into PR_AUC (Precision-Recall Area Under the Curve). Like the ROC curve, it’s a graphical representation of model performance. But it uses Precision and Recall.

Let’s break down these two components:

Precision: The ratio of true positive outcomes (correctly identified defaulters in our case) to all positive outcomes predicted by the model. It’s calculated as TP / (TP + FP). You can view it as, “When our model flags a customer as a defaulter, what’s the probability they really are?”
Recall (or True Positive Rate): Same component we discussed in the context of ROC curve. The proportion of actual defaulters correctly identified by the model. It’s the answer to, “Out of all the customers who defaulted, what portion did our model manage to identify?”

The PR curve, with Precision on the y-axis and Recall on the x-axis, illustrates the trade-off between these two metrics for different threshold values, akin to the ROC curve’s TPR and FPR trade-off view.

PR Curve Illustration with a target incidence of 0.1: Comparing Classifier Performances from Best to Worst.

The PR_AUC score, derived from the area under the PR curve, indicates the model’s accuracy (high precision) and its ability to detect a large portion of actual defaulters (high recall). However, the score’s dependence on the target incidence (proportion of positive cases in the dataset) may make it less intuitive for those who are not well-versed in the model evaluation.

The PR_AUC score offers a detailed insight into the model’s performance, especially on imbalanced datasets, proving to be an invaluable metric for data scientists and analysts.

Difference Between ROC_AUC and PR_AUC

Even though they both serve as powerful tools for visualizing a model’s performance, they differ in their underlying metrics and how they interpret a model’s effectiveness.

Here’s a brief refresher on the metrics used in both curves:

ROC_AUC

True Positive Rate (TPR): TP / (TP + FN)
False Positive Rate (FPR): FP / (FP + TN)

PR_AUC

Precision: TP / (TP + FP)
Recall: TP / (TP + FN)

The impact of imbalanced datasets becomes critical when the negative class significantly outnumbers the positive class, leading to an abundance of TNs. This imbalance can distort metrics like ROC_AUC, which incorporates the FPR in its calculation.

A high number of TNs can result in a misleadingly low FPR, even with many False Positives, thereby inflating the ROC_AUC score and painting an overly optimistic picture of the model’s performance.

PR_AUC, unlike ROC_AUC, doesn’t consider TN. It focuses on the model’s precision and recall, providing a more realistic evaluation of performance on imbalanced datasets by concentrating on the minority class.

As we’ve unraveled these two powerful metrics, you’re now equipped with the knowledge to discern which one is best suited for your unique dataset and business problem.

Next, we’ll apply these concepts to a practical problem, bringing these metrics to life with real-world data.

Practical Illustration: The Tale of Two Models

To bring these concepts to life, let’s consider two models trained on the same imbalanced dataset. We have two contenders in the ring:

Model A: A Logistic Regression model, simple yet effective.
Model B: A Gradient Boosting model, known for its precision and handling of complex datasets.

Both models were trained on a dataset with a 2% default rate, a classic case of imbalance, and tested on a set of 10,000 customers.

As explained before, performance metrics are usually calculated by integrating over all possible thresholds, not just a single point. However, for simplicity in this illustration, we will consider the errors given a fixed threshold and a simplified ROC_AUC and PR_AUC calculations.

Following the evaluation of the models, the performance of each can be summarized as such:

Model A’s Performance Card

| Confusion Matrix      | Predicted Non-Defaulters | Predicted Defaulters |
|-----------------------|:------------------------:|:--------------------:|
| Actual Non-Defaulters |           9600 (TN)      |         200 (FP)     |
| Actual Defaulters     |           100 (FN)       |         100 (TP)     |

Model B’s Performance Card

| Confusion Matrix      | Predicted Non-Defaulters | Predicted Defaulters |
|-----------------------|:------------------------:|:--------------------:|
| Actual Non-Defaulters |           9700 (TN)      |         100 (FP)     |
| Actual Defaulters     |           100 (FN)       |         100 (TP)     |

We now proceed to calculate the simplified formula for the ROC and PR area under the curves for both models (assuming just one data point).

Model A’s Performance Metrics

ROC_AUC: Approximately 49%, calculated as TPR * (1 — FPR) = 0.5 * (1–0.0204) = 0.4898.
PR_AUC: Approximately 17%, calculated as Precision * Recall = 0.3333 * 0.5 = 0.16665

Model B’s Performance Metrics

ROC_AUC: Approximately 49%, calculated as TPR * (1 — FPR) = 0.5 * (1–0.0102) = 0.4949.
PR_AUC: Approximately 25%, calculated as Precision * Recall = 0.5 * 0.5 = 0.25.

Comparison

When comparing both models, if we were to just look at the ROC_AUC scores, which are 49% for both Model A and Model B, we might erroneously conclude that both models perform identically.

However, the ROC_AUC metric, with its high sensitivity to True Negatives, can distort our understanding, especially when dealing with imbalanced datasets like ours. Neglecting this subtlety could result in rejecting good customers, consequently missing out on their potential lifetime value.

Despite Model A incorrectly predicting 100 more customers as defaulters than Model B, the identical ROC_AUC scores would suggest equivalent performance.

This is where the PR_AUC metric comes into play. With scores of 17% for Model A and 25% for Model B, this metric, with its emphasis on False Positives, unveils a different scenario — Model B outperforms Model A in correctly classifying the minority class.

Conclusions

In this post, we explored the complexities of model evaluation for binary classification problems with imbalanced datasets. We discussed the limitations of popular metrics like ROC_AUC and GINI, which can be misleading in such scenarios, and introduced PR_AUC as a more reliable alternative.

Through a practical example, we demonstrated how metric choice impacts the interpretation of model performance. We highlighted the importance of selecting a metric that aligns with your dataset and problem to ensure accurate model evaluation.

In conclusion, the key takeaway is the importance of understanding your data and the problem at hand. Choosing the right metric that aligns with your dataset and problem can lead to more accurate measurements of your model’s performance and provide meaningful insights. Remember, there’s no one-size-fits-all solution in data science. It’s all about finding the right tool for the job.

Angel Igareta, Senior Data Scientist
Medium | LinkedIn | GitHub

Did you enjoy this post and want to stay updated on our latest projects and advancements in the engineering field? Join the Klarna Engineering community on Medium, Meetup.com and LinkedIn.

Stop Misusing ROC Curve and GINI: Navigate Imbalanced Datasets with Confidence was originally published in Klarna Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Overcoming the Hurdle of Unformatted Input: What I Learned From Building a ChatGPT Add-On for…

Mikael Wulfcrona — Mon, 16 Oct 2023 09:50:38 GMT

Overcoming the Hurdle of Unformatted Input: What I Learned From Building a ChatGPT Add-On for Google Workspace

While ChatGPT is an incredibly powerful language model, it still struggles with unformatted input. As a tech nerd and passionate data scientist, I recently embarked on a journey to create an add-on for Google Workspace that could seamlessly interact with data stored in Sheets. Follow along as I share some of the main challenges, learnings, and solutions that eventually allowed me to integrate ChatGPT into Google’s productivity tools for textual interactions.

Unleashing the Potential of ChatGPT
I was one of many people who was very impressed with the early versions of ChatGPT, released at the end of last year. Its capability to generate human-like responses fascinated me. However, I struggled to integrate it into my daily work. Having to switch tabs and copy-pasting text from different documents kept me away from really adopting this new tool.

Determined to overcome this hurdle, I dove into the world of Google Workspace add-ons. You know where the work with text actually happens. Armed with enthusiasm and basic JavaScript knowledge, I got into building my first add-on. Having never built an add-on before, I forked Google’s own example repo and started working.

Working with the Google API
As I delved into the project, I realized integrating ChatGPT via API calls was relatively straightforward. OpenAI provided excellent documentation, and Google Apps Script offered built-in features to make API calls. The main challenges, actually, laid on the Google side. Extracting text from different apps within the Google Workspace ecosystem initially baffled me. I spent hours unraveling the intricacies of the Google API, determined to find a way to seamlessly access and manipulate data.

Overcoming GPT’s Limitations with Raw Data in Google Sheets
As I progressed, I encountered more roadblocks. Firstly, the infamous token limit. Some document were quite long and simply not possible to fit into ChatGPT. While there are great workarounds for this, deploying third party libraries to the add-on codebase proved difficult. This was further complicated by a hard 60 second execution time limit for Google add-ons.

Secondly, and maybe the toughest one, ChatGPT struggles when faced with raw data in Google Sheets (.CSV). ChatGPT requires formatted input to generate meaningful responses. This posed a significant obstacle to building an add-on that could interact with data stored in Sheets.

Undeterred, I devised a two-step solution. First, I created a data preprocessing module within the add-on. This allowed me to parse and transform raw data from Google Sheets into a format compatible with ChatGPT. I extracted the relevant information, structured the data, and generated prompts that ChatGPT could understand. Second, I developed an output formatting module that presented ChatGPT’s responses in a user-friendly manner, right within the familiar Google Sheets interface.

Conclusion
My journey to build a ChatGPT add-on for Google Workspace proved to be a rewarding endeavour. Overcoming the different challenges associated with the Google API and ChatGPT’s limitations made me realize that we are only scratching the surface of what’s possible to do with these cutting-edge AI models. Today, fortunately, the add-on has been released internally at Klarna to help my colleagues enhance their productivity within the Google Workspace ecosystem.

This is where the future of digital communication lies. I’m sure that with determination and creativity, anyone can embark on similar projects and make a positive impact on how we interact in the digital realm.

What’s next? Adding LangChain to Enhance Data Processing
But my journey doesn’t end here. To take the ChatGPT add-on to even greater heights, I’m planning to integrate LangChain, a powerful Python library. By leveraging its robust toolset, I will overcome the limitations I faced with raw data in Google Sheets and solve the issue of limited tokens. This will greatly enhance data processing and user interaction within the add-on.

Mikael Wulfcrona, Author
ChatGPT, Co-author and editor

Did you enjoy this post and want to stay updated on our latest projects and advancements in the engineering field? Join the Klarna Engineering community on Medium and LinkedIn.

Overcoming the Hurdle of Unformatted Input: What I Learned From Building a ChatGPT Add-On for… was originally published in Klarna Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Introducing native E2E testing: Learnings from the Senior Engineering Program for Women

Joana Melo — Fri, 08 Sep 2023 12:35:44 GMT

I made company-wide impact by successfully delivering the introduction of native end-to-end (E2E) testing in mini versions of the Klarna app. The goal was to have automated feature regression tests in our pipelines. I developed this as part of a program for senior engineering women, and today, I want to share the insights and learnings I gained from this experience.

Creating fair and equal opportunities for women

How do we offer women equal and fair opportunities in an industry dominated by men?

Well, there are many ways to work on this topic. One that piqued my curiosity was Klarna’s Senior Engineering Program for Women (SEPW).

As you might wonder as well, my initial thoughts on it as with any other initiatives like this came with a lot of reservations:

Is this fair? Is this the best way for me to ensure that I’m being fairly evaluated? Will it look like I am being brought to a speedlane towards an easy promotion if I happen to get one because I’m a woman? Are we going to get treated like tokens? Is this all just a marketing strategy to promote? Am I being part of and legitimizing something that has no real content and value for my career or other women? What will everyone think?

When we are faced to join initiatives related to gender gap improvements, we might fall into the trap of having all the perfect and right answers before we take risks, or we can accept that there will never be the perfectly carved, impactful and life-changing solution at our doorstep.

We can only experiment and learn from the results to make better decisions as we help evolving into a hopefully more gender-fair world.

As a woman in engineering, I understand the issues, but I don’t claim to have all the answers. And that’s ok.

The program

The SEPW is a way for Klarna to acknowledge and accelerate the professional development of promising engineers and promote diversity within engineering. The 6 month program is designed with the individual’s growth as the main focus, and based on four themes: Execution, Collaboration and Communication, Technology, and Influence.

It includes coaching sessions with exercises such as presenting systems architecture and solutions, and the participants are selected with support from the engineering leads of the department, lasting around 6 months, to a group of 6–8 women. You can find more information on this here.

Engineering jobs at Klarna | Klarna Careers

Going for it

I was offered a spot on the program and after reflecting on the questions I had, I decided to accept. The very first immediate valuable point was that I reached out to past participants to hear about their experience in the program, and the collaboration and support started right there.

Then came figuring out how to balance it with my team’s day-to-day work, in order to adjust the goals and size of the project based on that.

I created a list of projects, defined their impact, duration and effort and asked for feedback from key people.

Finding a project

From my own experience working on the frontend at Klarna and by talking to many other engineers that do the same, a clear pain that most of these teams have became clear: the testing and validation of the release candidate that goes into the app each week.

A lot of manual testing goes into it, which means testing our feature flows (mostly the same each week) in two platforms — iOS and Android. While we do have many tools for different types of automated testing, we were missing one to allow real end to end testing in native apps. One that could truly replace much of, or all release testing.

There were actually multiple attempts in the past to bring native E2E tests to the app for the overall teams along the years, but for a mix of different reasons, they never came to fruition.

After I discovered this opportunity, I took upon myself to work on introducing support to E2E testing on the native platforms.

Along the way, I have crossed paths and worked together with engineers that work on:

the core frontend tooling, reliability and pipelines of the app
other teams and departments that were also interested in and working closely with tests, or tooling, or help out the core teams and have a lot of know-how
end to end testing as a new initiative on its own
something else but wanted to help test the early stages of the solution

I got the big picture of the state of many of the central pieces in the app, while also spotting redundant work that people were doing between teams and supported in bridging those efforts.

This helped me figure out a more precise plan and goals for myself as well.

Technical challenges and improvements

To bring this project to life, I introduced several changes to our tooling, a few of which I will be illustrating below.

Typescript and globals

I discovered that through different products included in our monorepo, a few Typescript globals were defined, including Jest and JQuery. Since I wanted to include Typescript support in the test definitions, and we were instead using Mocha and Appium, syntax like “expect” or “$” was colliding with the predefined globals. Since these were used in several places, it would be a challenge in itself to migrate the projects to use local definitions.

A big takeaway from this is to always avoid defining globals as there are other libraries that you might want to use in your project later that might have the same syntax commands as your current ones.

To address this, the first step was to exclude the Appium tests folders from the global tsconfig:

"exclude": ["(…)/__appium__/**/*"]

In the global clients tsconfig I added Appium test files to the exclusion list:

{
  "exclude": ["(…)/__appium__/**/*"] 
}

Then, in the Appium tooling code, I included only the local files and tests, and also had to ignore the Jest types:

{
  "include": ["(…)/__appium__/**/*”, “./**/*"],
  "exclude": ["node_modules/@types/jest"]
}

In our deprecated Appium tooling, we had centralized tests in a single folder. Since collocation of tests was also introduced, meaning each test file would be able to live next to its feature and be moved around together, and features still needed to be using the global tsconfig, each feature’s tests folder was required to include a tsconfig extending the tooling tsconfig as follows:

{
  "extends": "(…)/appium/tsconfig.json"
}

Naming conventions

I have also defined a specific namespace for Appium tests:

"app/__appium__/**/*"

And in order to help the evaluation and discoverability by our tooling, including it in the .eslintrc.js in the same format:

"**/*.appium-spec"

Sub dependencies nightmare

Since I moved the Appium tests from a separate standalone subproject with its own configuration and local dependencies to a central root location in the repo, and by consequence included this tool’s dependencies in the root package.json, a few issues came up with multiple versions of some common sub dependencies used already as well by the rest of the app. This is protected to avoid dependency redundancy in the project, and we use Diglett for detecting these. Since several of these sub dependencies were very common and used already by other areas of our app, it would be nearly impossible to version bump and match all of them in the scope of this project, so I added the needed entries to diglettignore.txt.

The coaching

While working on this project, the SEPW provided coaching sessions. Internal and external presenters shared their expertise, allowing me to apply this knowledge to my project and beyond. Additionally, we as participants had opportunities to present our progress and practice on leadership soft skills, while receiving valuable feedback from coaches and fellow women in the program.

In the scope of these networking sessions, I talked to other women at my level about their work, their SEPW project, and the various engineering, work and gender related challenges they face. We shouldn’t underestimate the power this type of connection brings.

Outcomes

At the end of the program, I delivered on my set goal: working E2E tests in native in feature apps — with efficient runs and less friction in mind, for an enhanced developer experience and speedier pipelines. Ultimately, using this brings a lot more trust and robustness in any changes we make in our and others’ features, knowing that these will not introduce a corner case bug that we only detect the following week while doing release testing once again.

Along with all of this I also brought test collocation, in the features, so that we can scale to many teams easily, and more importantly, have a clear view of ownership and accountability. This has also become the reference for doing the same effort of co-locating our functioning central Cypress tests, and the testing ground to the challenges of that migration, which will save the core teams’ precious implementation time.

I have gained valuable skills in technologies such as Webdriver.io, Appium, Mocha, and Allure.

And ultimately, by promoting these achievements in our Slack channels, I saw more and more people interested and courageous to introduce these types of tests in their own features, reaching out to me for support setting up their local environment.

Learnings

I was able to show what I can do, while improving important tools and my own skills along the way.

I have learned how our core tooling code is set up, and was able to contribute with both suggestions and concrete actions.

I have realized that this type of initiative offers opportunities that may not otherwise be accessible due to various factors, including personal drive, team context, and yes, unconscious gender bias.

What is also very motivating is that, as I finished the program and my project, other women who were curious about the program or already joined a new iteration of it, reached out to me asking for opinions, support and advice — reinforcing the importance of being/having a role model and reference as an engineer.

And of course, I am now also part of a network of fellow engineers from different parts of the company that has participated in the program and the different coaching sessions, as well as the organizers who are clearly invested in the topic of diversity and gender equity. A network with an amazing sense of belonging and shared experiences.

The visibility of my participation on the program has allowed me to promote the initiative, my project and support for Women at Klarna. But also beyond, to effectively contribute to making the program better, and being part of company-wide initiatives that impact the work on the gender gap.

If I could have any advice for my past self, I would say “It will be challenging. It will be completely unfamiliar territory. That’s why you have to go for it!”.

Did you enjoy this post? Follow Klarna Engineering on Medium and LinkedIn to stay updated on more articles like this.

Introducing native E2E testing: Learnings from the Senior Engineering Program for Women was originally published in Klarna Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.