In this post, I want to talk a little about GraphQL , the newest (sort of, not) and hottest tool that supposedly helps you build better HTTP APIs and how it fits in the bigger scheme of things.
I will talk about this new question technical decision makers have to answer more frequently when having to decide on the technologies/architectures of greenfield projects. I will talk from a strategic point of view, and will try to point out the advantages and drawbacks of both solutions, what we need to take into consideration when we pick GraphQL over REST, REST over GraphQL, or both (or none?).
I am going to say this straight out of the gate: both are good, one is not better than the other. Each one has its drawbacks and its advantages, and specific use cases.
Firstly, to settle on a common vocabulary, I will talk about what they are, because out there in the industry, there are various misconceptions about what GraphQL and REST mean.
They are not a technology, they are a set of guidelines/specifications that help us structure our HTTP APIs in such a way that makes sense and help our clients with consuming these APIs. By adhering to such a specific guideline, we enable our clients to use specialized clients that work well with our servers, speeding up their development time, minimizing maintenance and allowing them to use battle-tested open source clients so they won't have to reimplement the wheel.
REST comes from RE presentational S tate T transfer and comes with a set of guidelines for structuring and designing your API so that you get a predictable, extensible and functional API.
In the case of RESTful APIs, it is very important to use the HTTP specifications to its maximum:
These three things are the pillars of building a RESTful API. Things such as the way we serialize the data, whether
it is JSON or XML, doesn't matter. Ideally, clients would specify the format of the data they would like to receive
in a Accepts
request header, and they will be able to get the data in various formats.
For example, let's say we are developing a web app that controls virtual machine deployment in a datacenter. Exposing functionality about virtual machines would look like this:
There are some status codes that can be returned by all endpoints, such as
Having to create these URL set for each resource, creates a lot of clutter, makes it harder to develop and maintain them and each endpoint has to be developer individually (document the URL, the accepted query parameters, the accepted payload if any, and then all the possible response types, from successful results, to all possible errors that can occur).
Having a RESTful API that manages a few resources shouldn't be too much to handle for a small team. But in the wild, with the complexity of today's web apps, the number of resources is pretty big.
Usually, resources have connections between each other. A virtual machine has some storage blocks and some network interfaces attached, a cluster has some physical nodes, these nodes have some virtual machines, virtual machines are owned by users or teams, teams have some users, users have some permission policies attached, etc. That's how real-life applications look like. It is very rare that you have resources that are entirely independent.
When we linked resources, we can say they are nested from a data retrieval perspective. For example, when we want to query a virtual machine, we don't really want to get ALL the data attached to it, because we don't really know what data the client needs. Let's take the VM and storage blocks and network interfaces example.
Sometimes, clients will need all the information to display it, sometimes they don't. So we don't really know what to return. We are left with three choices:
First one is to include all the information for a nested resource in the parent resource
GET /vms/100/
{
"id": 100,
"name": "server-apache-2",
"specs": {"ram": "1Gi", "cpu" 1},
"storage_blocks": [
{"id": 1004, "name": "apache-main-disk", "capacity": "100Gi", "used": 0.67},
{"id": 2003, "name": "apache-secondary-disk", "capacity": "25Gi", "used": 0.22},
{"id": 4503, "name": "apache-third-disk", "capacity": "10Gi", "used": 0.11},
],
"network_interfaces": [
{"id": 11754, "name": "vpc-1", "attached_to_vpc": 15324},
{"id": 55432, "name": "vpc-2", "attached_to_vpc": 65353},
{"id": 24341, "name": "vpc-3", "attached_to_vpc": 98743},
]
...
}
But then, we run into another problem: nested resources might have other nested resources as well (eg. network interfaces will be linked to a VPC. Should we include that data as well?). If we just include all the nested resources whenever we can, we will end up dumping the whole database with each request. That's not really a good long-term strategy.
So we have another choice: include references to the linked resources , and make the client request the data of the linked resources by their ID, from their own endpoints. The request from above becomes
GET /vms/100/
{
"id": 100,
"name": "server-apache-2",
"specs": {"ram": "1Gi", "cpu" 1},
"storage_blocks": [ 1004, 2003, 4503 ], // or with hyperlinks: [ "/storage_blocks/1004/", "/storage_blocks/2003/", "/storage_blocks/4503/"]
"network_interfaces": [ 11754, 55432, 24341 ]
...
}
This way, we only include the IDs (or the hyperlinks) of the nested resources, so that the user can retrieve them individually if they are really interested to see more detailed information about each nested resource.
This approach will result in a lot of requests done to get the complete data (to get the same data as in the first approach, 7 requests need to be made - one for the vm, 3 for the network interfaces, 3 for the storage blocks, very similar to the infamous N+1 queries problem , but in the HTTP APIs space).
Third approach will be having the nested resources as a subset accessible at once, on top of the main resource.
The GET /vms/100/
will not include the storage_blocks
and network_interfaces
keys, but instead we will have
available two extra endpoints that will return the list of the resources nested in the VM:
This, way, we can also model the CRUD operations on nested resources (eg. attaching a new storage block will be done
with POST /vms/100/storage_blocks/
, removing a storage block will be done with DELETE /vms/100/storage_blocks/4433/
),
problems we didn't even think about when dealing with the first two approaches.
So, how to deal with nested resources? It depends (of course) on the access patterns by investigating how often resources will be needed to be retrieved together. If a lot of clients need the nested resources every time they query the main resource, it would make sense to include the full nested resource data in the main resource query. There's no perfect solution, and each of the three approaches works on specific data access patterns and data types. It's up to you or the application architect to figure that out.
Enough negativism, REST has some good parts too. The biggest advantage by far is the familiarity: every developer
can figure out a RESTful API without significant effort. Endpoints are separate, if you need to get resources, you
issue a GET
request, the code that interacts with such an API is self-documenting to some degree
(hmm I wonder what requests.get("https://api.example.com/pending_payments?last=10")
does... does it retrieve the last
10 pending payments? It's entirely possible...).
There are a lot of tools out there that can automatically document RESTful APIs, and when developers have to deal with another 3rd party service and have to integrate with their API, they expect to find a RESTful API ready to be used. They don't have to learn new tools to get their job done: just raw HTTP requests and JSON parsing gets the job done.
Enough talk about RESTful, it's time to talk about its younger sibling: GraphQL.
What is it anyway? GraphQL is a specification for designing web APIs, it was made by Facebook, and its main focus is to resolve the biggest pain point of the RESTful APIs: nested resources.
GraphQL has these characteristics:
We typically need to retrieve more resources at once, based on their relationships, based on what our intention is, etc. The server can't predict what the client needs: each client has very different requirements, and it's impossible to design specific endpoints that return just the right data for each client. It would require a tremendous development effort which surely isn't worth it.
So, in GraphQL, to get just the data we need, we would be able to do a query similar to
query MyQuery {
vm(id: 100) {
id
name
specs {
ram
cpu
}
networkInterfaces {
id
name
}
storageBlocks {
id
name
capacity
used
}
}
}
or to get just the storage blocks capacities
uery MyQuery {
vm(id: 100) {
id
storageBlocks {
capacity
}
}
}
and any other combination of fields and nested entities.
What a GraphQL endpoint is: an endpoint that serves data under a specific static strongly typed schema, so that the client can then get that schema (which is also self-documentating), and then craft their queries based on their speecific needs. Then the server parses the query and retrieves just the data that was requested.
What it did in fact, was to shift the responsibility of determining what data to return from the client to the server. The server just says "here is what data I have available, with these fields, parameters and all these connections", then each client, based on the given specifications, through the query language communicate to the server exactly what data they need.
Another advantage of GraphQL is that the client gets only the data they need. When using REST, if the client
needs only one specific field from a specific resource, there isn't a way to get just that. They have to retrieve
the full resource and ignore the rest of the unwanted data. There are some weird implementations that can work around
that (eg. adding a query parameter ?fields=name,capacity.ram
to each request) but that can be very awkward to implement
and maintain, especially when you deal with nested resources.
With GraphQL you only get exactly what you requested, thus avoiding under-fetching (having to do multiple requests to get all the data you need) and over-fetching (getting more data than you actually need, because there's no way to easily get only a subset of the fields).
Being such a new specification and having a lot of features, not all developers are familiar with it. Integrating with a GraphQL API is more cumbersome and requires more development effort.
And that's not because of the implementation complexity (on that aspect it's all just a HTTP request after all), but because of the GraphQL language itself: the developers need to learn new concepts, learn to use them and craft their queries carefully, which is a whole thing in itself).
Because of the flexibility GraphQL allows in data fetching (basically allowing any combination of retrieved field, and with no restrictions on how nested you can go with your queries), the backend has to support all that.
The biggest clash between the backend implementation and data fetching flexibility comes when querying nested resources, and these resources are stored in a relational database. Usually, this kind of relational data is retrieved from the database using joins, to avoid the N+1 queries problem, but when developing a GraphQL server, that's a little harder than usual to do.
This is due to the fact that you can't really know beforehand how the data will be fetched over the lifetime of the application, so you kind of have to prepare for all the cases. In theory you can craft special queries for special cases but that takes time and effort.
For example, you have three resources A
, B
and C
and a relationship A -> B -> C
. In the GraphQL query, it would
look something like this:
query {
a {
b {
c {
someFieldOnC
}
}
}
}
The server can resolve that data through a three table join (between the tables for A
, B
and C
), but when the
query changes, a new case appears.
query {
a {
b {
someFieldOnB
}
}
}
Now we don't need C
anymore, so the three table join is not needed anymore. We can get the data we need in a single
two table join. We could in theory handle these two cases independently on the server, but that's not a feasible
long-term solution. As I was pointing out before, real-life applications are more complex, have a lot of resources
that are even more inter-connected, and covering all the possible access patterns is simply not feasible.
There are some solutions for this such as data loaders but they again add some complexity to our code, and introduce more advanced programming patterns in our code (eg. promises and asynchronous programming). With more advanced patterns, the development costs increase, new people will need more time to digest whatever is happening in your code base, juniors will get overwhelmed by all this complexity (you can't reasonably expect junior programmers to be comfortable with asynchronous programming).
Both REST and GraphQL have their good and bad parts. When deciding on what to use, there is no golden recipe for choosing the right one (hint: there is no right one), and we need to choose one based on the limited knowledge we have at the moment. To reduce the chances of a failed project, we should at least ask the following questions:
Asking and responding to these questions should give us a better idea on what we can get away with and how we should use our resources. A lot of companies opt for both APIs at the same time: a private GraphQL for internal use and a more restricted/limited public REST API for external client integrations. That's a valid strategy too.