For a while now, I wanted to write a few words about this topic: what is the absolute worst decision you can make as a decision maker involving a tech product (internal or external)? I guess we all have our own definitions for what that "absolute worst decision" is. For some, it might be hiring the wrong person, for others might be accepting and becoming too comfortable with techincal debt, or building with the wrong technology stack.
Although all of these are valid concerns, they can be addressed and fixed to some degree, and if not fixed, they at least can be kept to a manageable level and lived with.
The thing I have in mind is the decision to completely rewrite a tech product, be it internal or external, from the grounds up, under the pretext that this time we learned from our mistakes, will make better architectural choices and tech debt will be resolved.
Let me address these reasons one by one and tell you why a complete rewrite of a product is doomed to fail from the moment that decision is taken.
The first reason this is a bad idea is because we need to split the team (or teams) responsible of the current product in two: one half will need to maintain and continue working on the current product to keep the current clients satisfied and the business progressing and one half that will plan, write, test and deliver the new product.
This will lead to a bad situation: the existing product will receive new features and bugfixes while the rewrite will advance at a delayed pace. Reimplementing the features one by one will take time, and you will end up with two products that fill the same role, but will advance in parallel at different paces. The end goal would be for the rewrite to catch up with the original product, but that will happen probably years down the road (if at all). And in that time, you will basically split the tech team in two: a team which delivers value to existing customers, and a team which is working on a product that does nothing, with the hope that one day you will merge the teams once again.
Another caveat with this "split the team" issue is that the decisions that were made before, that resulted in the mess you are trying to fix by a complete rewrite, were taken by the same people tasked with the rewrite. And the decisions that piled up the technical debt were not taken due to a lack of knowledge (maybe to some degree, because ideally, the people who worked on the product initially and accumulated all that tech debt would have grown their skillset over the time they worked there).
Tech debt is a deliberate business choice that results from tight deadlines and prioritizing fast delivery over quality, which are actually a conscious business decision early on. My take on this is that, as a company, you need to prioritize delivering value to customers first, while keeping tech debt at a managable level in the background. You do this by dedicated some time during the sprints to fixing some areas that start becoming too messy, and having a good set of tests to allow you to refactor with ease. After all, clients don't care how long your functions are or what their ciclomatic complexity is. They care only about what you can do for them to make their lives easier.
Another problem with a complete rewrite is that it is a faulty strategy once its done: how do you migrate all your customers to the new product? Ideally, their experience will be the same, because the ones that are actively using (and paying) for your product already have the current UI and interactions ingrained in their muscle memory. If the rewrite is too different, you will cause a lot of frustrations and will get a lot of customer support pressure because clients won't be able to figure out where the things they knew exactly where they were, are now.
So, here are two situations: you either move all your clients to the new platform at once in one switft migration, which will result in a "big bang" release, or you migrate slowly your clients to the new platform, which will be a slow process of basically onboarding everybody once again.
The problem with the Big bang releases is that a lot of things change at once, and a lot of things can go wrong. With small incremental improvements, you can introduce fewer new bugs, find them fast and fix them, while with big releases you can introduce a lot of new bugs at once, which inevitably passed the initial testing (it is hard to cover all the real-life usage scenarios in testing, and clients tend to do the weirdest thing that you, in most cases, can't anticipate).
The final problem is what guarantees you that the new product will be a good replacement? The original code ended up messy because of all the requirements that poured in, changing in user behavior and company priorities, etc. The requirements are there, in the code, same as the list of very specific edge cases discovered throughout the years. I bet that, no matter how good the team rewriting the codebase is, not all these cases that were previously handled are handled in the new codebases. Old bugs will start to surface again, which will require emergency fixes, which will uncover some wrong assumptions, and the code will start decaying again with new patches over new patches delivered at a (maybe too) fast pace (because clients aren't going to wait for you to figure out how to fix issues the correct way, you need to fix their issues fast).
I often say that there are nuances. Not everything is good or bad. But in this case, this is an opinion I strongly hold: complete rewrites are complete bad, and there is another better approach when dealing with codebases that are nearly unmaintainable and cause dread for the developers tasked with working with it.
This approach involves a continuous process of improving specific areas, after you discover what are the biggest pain points. For example, it is advisable to identify and start with the areas that see the most code decay, and are the most impatcful for the clients. These I call "the core areas" of the code: that parts of the application that sees intense usage, the most bugs and the team ends up spending a lot of time there trying to add new things or fix existing things.
Once we identify such an area, we need to prepare it for "fixing". To do that, we need to dedicate and spend some time and resources to test it thoroughly and then to systematically improve it.
It is crucial to write tests at its boundaries: we need to identify its boundaries (or the desired boundaries), and start writing a lot of integration tests around it. If data structures from other domains leak it, we need to mock them in. It is crucial to have a bunch of good quality tests for it, where we test the internal behavior of that subsystem, before we start touching its code. It will be a painful process, for sure, but once we have these tests, we can start moving the needle in the areas we actually care about: reducing code debt.
The tests need to test the boundaries, meaning that we only care about the contracts that sub-system has with the rest of the codebase. For these inputs, have these outputs. The more cases we put it and edge cases we cover, the better.
Then, we can start refactoring code, and targeting specific code smells as we encounter them. The most impactful code smells we can fix are:
This is a process which needs to be repeated until we are satisfied with the result. It is not necessary to fix the whole system: some technical debt is desirable, because the more time you spend refactoring, the less you gain (the law of diminishing returns). Therefore, some technical debt in the areas that don't see much usage, don't see a lot of bugs or the team doesn't need to touch too often is not critical to get fixed or handled at all. As long as the clients are happy, the developers are happy and do interesting things on the areas that have the most business impact, and the stress levels of the team are fine, then the business can progress and the product can evolve at a good enough pace.
In my opinion, fixing the technical debt should not be the focus of the tech team. Some tech debt, to some degree, is an acceptable outcome as long as it is kept at managable levels. If some areas of the codebase become too messy and entangled, we need to tackle them one a time, in the order of the business impact: the more bugs are popping out in that area, more time is spent working in that area and the more frustrations it causes the developers tasked with anything related to that area, the higher the priority of handling that area should be.
It is crucial to handle these messy areas one at a time, and never completely rewrite a whole codebase. The main reasons agains a complete rewrite are: developers need to be dedicated to it, which will reduce the capabilities of the team handling the "old" live version, it is impossible to cover the exact edge cases that were fixed in the original version along the way, and assuming the rewrite finally catches up to the original product, the release process would be painful from a business perspective. You either do a big bang release and onboard all the clients once again (which will start a huge wave of bug fixes and customer support requests) or you migrate clients in smaller batches, which could take a long time and you will have two competing products which will keep advancing in parallel, at different paces.
You need to tackle the problematic messy code areas by: identifying and prioritizing the worst ones, writing a lot of good behavioral tests so we can start iterating on its internals without affecting the overall functionality, start tackling the most common code smells, and then start isolating internal sub-systems (enforing a good separation of concerns), and enforce these boundaries by allowing these subsystems communicate through well defined contracts which are not allowed to leak domain specific knowledge accross boundaries.