A code smell is a programming pattern that affect the long-term maintenance of some project. It is a bad design decision took at some point in time, that affect the way other people will interact with the code later on.
Code smells aren't necessary a single decision. It's not only one person's contribution. It is the whole team. Most frequent problems created by code smells are in the long term, and are created by compounding problems on top of each other. There is one bad decision at the beginning, either due to the pressure from management to ship quick, or other constraints, but the rest of the team, when facing that poor decision, instead of refactoring it and fixing it when the problem is still small, they build on top of it. And the problem compounds over time until the effort to fix it is too much to justify the investment anymore.
I'll talk about the four most frequent code smells I encountered, and some powerful methods to refactor their code into something that is more maintainable and reusable.
This one is my top favorite and I see it all the time in Python codebases.
How it appears
You have a simple function that needs to return a pair of things. Let's say some user IDs and their emails, for a later
processing. A fast solution is to just return a dict
with the user ID as the key and the email as the value.
Then we can process them, because we have all the information we need at the time.
Then a new requirement comes. The email is not enough anymore. Now the user has multiple emails, one primary and can have zero or more secondary ones, and we need all of them, and also knowing which one is the primary one.
We can extend the function to now return a more complex object as the value, which will include all the necessary information.
Now we have complex objects with serialized user information leaking in our code. Other colleagues need to extend the same area and start building on top of them.
After a while, we end up with a huge user representation in a dict
, and a bunch of small functions that know how
to work with those objects (eg. get_primary_email(user_dict)
).
Why is it a problem
How to fix it
Introduce objects early on. Even though it is only being used in one place, contrary to popular belief, single use classes are fine. They enable easy extension over the long time and encourages grouping the related functionality on it (through methods) so it is more easily discoverable by the team.
An extra advantage is that classes define a clear contract that is being used by higher level components. So during testing (or other similar parts of the code), as long as we implement some classes that obey that contract, we can use them as a drop-in replacement for the original class, and customize the behavior of certain sub-systems even more.
If a function is too long, it is clear that it does too much. They usually have multiple segments that can be easily extracted and isolated.
My personal rule of thumb is that, if I need to scroll to be able to see the whole function, it's too large. That limits the size of my functions to about 40 lines, which is more than enough to do the things I need to do. It also serves as a hint that it's time to extract new methods if I reach that size.
How it appears
This kind of mess can appear in two ways: either a solo developer, due to the lack of experience or laziness creates a huge functions that just works, and then commits it. And then somehow it gets past the review (so it's actually a team effort), and reaches the production codebase. Then, being a big chunk of code, other teammates will be afraid to change it, especially if it works correctly and isn't properly tested.
The other way for this code smell to appear is by many incremental changes. When the method was first created, it was small, clear and concise. But then, as more requirements appeared, the team just added more and more, added more branches, exceptions, conditions, etc. and the function grew out of control. By the time somebody had some time and will to refactor it, it became very big and very entangled. This way is more frequent that the first one, beecuse it is harder to see in code reviews. All the reviewers see are a few lines added each merge request.
Why is it a problem
Big functions are hard to manage, hard to debug due to the big number of things happening at once. Big functions are big because more than often they do more than just one thing and don't delegate enough. Due to this, they are harder to test properly (complexity is off the charts due to the huge number of possible execution branches).
If it is hard to debug and figure out what it does, it is more than likely a big time sink for the team. Whoever has to debug something inside it, will lose more time than they should and slows down their progress. Over time, it becomes "that functions" that everybody is afraid of. It's sole existence decreases the team morale.
It's true, I have seen functions of hundreds of lines of code. Debugging felt like testing an end to end product by myself to figure out all the behaviors of the function.
How to fix it
If a function as too many parameters, our total possible cases are the total combinations of all the parameters. As we add new parameters, the total number of use cases raises exponentially. That's why, 4-5 parameters should be the maximum for any function.
Having too many parameters is also an indication that the function is doing too much. We need to investigate whether that's the case and we should fix that code smell before.
How it appears
Just like the 2nd reason for big functions, it appears over time. Adding new parameters one by one is a common cause for this, which is also due to the fact that the slowly the function grows into doing too much.
But even if the function does just enough, we can still end up with too many parameters.
Why is it a problem
With too many parameters, writing tests for it is harder. Ideally we would have to go through all possible parameter combination to ensure the function is completely tested.
Also, it greatly reduces readability of the code that uses that function. It forces other parts of the codebase to handle too much data at once, just to be able to call this function.
How to fix
This is a problem more specific to Python, because I've seen the multiple returns (which in fact it's just a return of a tuple) abused so many times.
How it appears
There is a function that does some computation and returns a single result. Then the developer discovers that they need some more internal information from that function as well. So they add another return type. It's easy, it's fast, it does its job.
Then another developer works on the same area, they need another extra piece of information returned from that function. It already returns two things, and adding a 3rd one is easy, fast and it does it's job.
Then another, and another, etc.
Why is it a problem
Our function ends up returning 6 things. When returning many things and becomes widely used in our code, adding new returns becomes more frustrating because now we need to add a new variable to all other code that uses this function (thus leaking data to that code it doesn't need).
These functions are very error-prone to a special kind of errors: the order of the return variables and the order of the variables that receive that data needs to be the same. You'll say that it's not a problem, but when managing 7 variables at once, it's easy to swap two if you write it manually. If you copy-paste the return in your function, then it's another problem.
Yo can see this kind of code smell enables all kind of bad programming practices: copy-paste-ing code, subtle bugs, information leak in components that don't need it.
How to fix it
I think most bad code and code smells come from the bad assumption that we should extract in functions and classes the code that we want to reuse. If we don't plan to reuse it, don't extract it. But that is false, because we should teach other programmers that functions and classes do not always need to be reused. They serve a higher purpose, that of organizing our code better, structure it better, creating contracts between different levels of the code, and allow for better testability, readability and discoverability. I find these three things to be the pillars of a good code base: