🇬🇧🇺🇸 Your code is as good as your data is structured

Vlad Calin

27 Jun 2023 • 5 min read

The conversation about code quality usually revolves around code smells, patterns, modules, languages, principles and other things like that. Things that, indeed, affect to some degree the quality of the code, and it is much easier to talk about them in isolation to put up a fast example.

But I have another take on code quality: the overall quality of your codebase is very dependent on how your data is structured. If you have a lot of data duplication, regularly have to sync data in multiple places even for simple actions, your code base is probably a messy spaghetti which makes people wonder why they are willing to put up with it.

You can showcase a diagram for a design pattern and explain it for a very specific example. How does a Car inherit from Vehicle and how it encapsulates an abstract Engine instance, which can be of type ElectricEngine or InternalCombustionEngine

I have seen such examples way too often, examples which are impossible to appear in the real life. Things like cars, students and teachers, employees, etc. But anyway, that's another topic.

Coming back to the data, if your platform stores its data in some inefficient formats, it will begin over time to kill your app performance and your developers mental health. Bad data leads to bad code which leads to delays, missed deadlines, frustrated developers and overall, a bad life.

Identify the access patterns

There is no right way to structure the data and there is no magic pattern to save you. In some cases, it is desirable to store each resource individually in different tables, sometimes it is better to "embed" some resources inside other resources and work with them as a group.

But depending on the expected usage, you can definitely make some decisions which will prove to be of high impact down the road, as the codebase and the team grow:

Resources that make sense only attached by another resource. There are few cases, but usually when you have more complex configurations and you try to persist everything in different tables. For example, what would have happened to Kuberntes if they decided to make a PodTemplateSpec resource detached from a Deployment? And every time you would need to create a Deployment, you would have to create a PodTemplateSpec, and then create a separate Deployment which receives some kind of foreign key to the PodTemplateSpec. Each Deployment resource would actually be two resources at minimum because you can't create one without a PodTemplateSpec, and a PodTemplateSpec doesn't really make sense by iteself. So they are created, deleted, modified, fetched together, the PodTemplateSpec being embedded into the Deployment itself. If you are not careful with this, then to build the whole resources with all its dependencies quickly turns into 7 table joins which are a nightmare to handle, for your team and for your database query planner.
- https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#podtemplatespec-v1-core
Resources that just exist and don't get updated. In some cases, we have some central piece of information in our system that once it got persisted, it doesn't suffer changes very often, maybe once every few months, and each change has a great impact over the codebase. In these cases, it is better to just store that resource outside the database, into some kind of configuration file or even static hardcoded code. By doing that, you can easily put it into production (of course if it doesn't contain sensitive data), and even put some static layers of validation on top of the code that interacts with it.
Resources that accumulate and don't really get updated. In a lot of cases, we have some resources that we create (a lot of them), and their table becomes some kind of "append only" database. We put data in, our business usually care only about what is recent, and the old stuff is kept "just in case". For these cases, adding a creation timestamp and index on it makes the life way easier down the road, because with this ever-growing tables, speed will quickly become a problem.

Don't always deduplicate

As software developers, we are thought from a very young age that we need to avoid at all costs duplicating code. We transfer these teachings to the data realm, where we start to think about how we can deduplicate data, as we want to save those kilobytes of storage. The business is relying on us to do that, better spend those engineering hours to save 0.00003 cents worth of storage (not).

I have seen a lot of cases where some data is duplicated, then it gets extracted in some different table and all the resources that need it will gain some foreign keys to access it. For example, the most frequent I saw it with cities, countries, etc. Because for some developers, having a centralized table with the cities/countries is crucial to the business' success, when in fact, unless you are a platform that does some very specific localized things such as AirBnb, Google Maps or the likes, you don't need it, and the clients don't really care that when they type in "Iași, Romania", they get a foreign key stored instead of the raw string. Sure, it might have its advantages for creating statistics and analytics, but more than often, an index on it is way more straight forward.

Be exigent with your data

When we start to accept incomplete, ambiguous or wrong data under the lie that "we will handle it later", it is a recipe for disaster. Once you start accumulating wrong data, it just accumulates. And if your business is accelerating and onboarding more and more clients, all that wrong data accumulates, the "cleanup" gets pushed back, and you start to work around it.

All your code becomes 80% checking if your data is correct and different reconciliation algorithms that are very specific to the current business objective, and 20% the actual business logic. Then you have different parts of the platform that reason about the missing data differently, and you have a perfect recipe for developers, stakeholders and clients frustrations. When a team decides that a missing product name means that we ignore those products, and another decides to include them with a name of "missing" , causing conflicting things to happen, it is already too late.

Another big issue, is that you accept wrong data and rely on your reconciliation algorithms for the platform to function correctly, then you are going to get wrong data, from unplanned sources. For example, we all know that programmers aren't perfect (hard truth, I know), and bugs slip through. When bugs related to getting data into the platform happen, and go undetected for a while, it means that wrong data accumulates with a ratio of wrong to correct data of 1 to 0. Basically, some bugs can really mess up your data, and when you figure it out, you have some bad days ahead of you: do you fix it? how do you fix it? how broken is it? In the happy case of the data not being completely messed up, what is the plan to fix it retroactively? Does the programmer that put in the bug need to update their CV? Or the tester that didn't catch it? Or the manager who pushed the team to deliver faster because he underestimated the complexity and gave an unrealistic deadline to the stakeholders?

You don't want to get there. So a good practice is to implement validation exactly at the place where data gets in, so you reject invalid data. It is better for something to just crash and tell you "Hey something is wrong, maybe you should fix it" rather than going undetected for days or even weeks and start requiring your codebase to work with that wrong data.

It is unpleasant, trust me...