The 5 problems your GraphQL server needs to solve before it can go live in production

Jacob Voytko
11 min readJan 25, 2021

It has never been easier to make a GraphQL server. But ensuring your server is secure is another thing entirely. GraphQL is a flexible technology. This flexibility is beneficial to architects who are designing a new GraphQL API, and frontend engineers building new experiences. Attackers also love this flexibility. It gives them new avenues for finding data incorrectly protected by authorization. It gives them the ability to scrape your entire site from a self-introspecting endpoint. They could even just write massive queries to take down a server entirely.

Below, we will walk through the 5 problems that your GraphQL server needs to handle before it goes live in production.

Problem 1: Inconsistent authentication

Real world example: Hackerone leaked Hackerone user data

Your data model might end up with fields with mixed authorization values. Even if you don’t start this way, it can evolve this way over time. For example, you might have a few record types in an eCommerce setup:

  • An Item that can be purchased
  • A User that can be a buyer or a seller
  • A Transaction type linking a buyer, a seller, a payment, and an item
  • A Payment type indicating how the buyer purchased it
  • A Disbursement type indicating how the seller will get paid

Your GraphQL schema modeling for a Transaction might look like this:

type Transaction {
# Fields visible to both parties
buyer: User!
seller: User!
item: Item!
# Field visible to just the buyer
payment: Payment!
# Field visible to just the seller
disbursement: Disbursement!
}

We have three authorization models in this type: there are a few fields that all users can access. However, the payment information is sensitive. The seller should never see this information because it is private to the buyer. But the buyer might want a reminder of which credit card they used to purchase the item.

Similarly, the transaction might link to data about how the seller will receive the money. The buyer cannot see information about this, because this might reveal sensitive financial information about the seller.

This is just one object and two users. You can imagine having hundreds or thousands of mixed-authorization fields in a large schema. How do you get this right?

How to implement granular authorization: Defensive programming

The first step is to program defensively. GraphQL resolvers are passed a “context” object that is global to the request resolution. You can take advantage of this to make authentication and authorization separate layers that work together. In the authentication layer, you verify that there is a valid user on the request and hydrate a User object representing them on the context.

When loading related data, use this User on the context to your advantage. If you want to load a Payment, and you have a user and a transaction ID, you can fail.

if (transaction.payment.buyer_user_id !== context.user.user_id) {
throw Error('This user cannot access a Payment owned by another user')
}

Centralize this logic to make it easier to use correctly. For example, you could make a PaymentView class wrapping a Payment and a User that performs checking for you. You could also add a method getPayment(payment_id) on the context class that automatically injects the User from the request, to make it harder to mess up.

If you want a full code example, the “References” section below has more information.

How to implement granular authorization: Directives

Some GraphQL systems allow you to define extra directives and their behavior. Instead of relying on manually written logic to check against a context, you define on the types themselves that checking needs to happen.

type Transaction {
# Fields visible to both parties
buyer: User!
seller: User!
item: Item!
# Field visible to just the buyer
payment: Payment! @privateTo(field: "buyer")
# Field visible to just the seller
disbursement: Disbursement! @privateTo(field: "seller")
}

This would be paired with a definition for @privateTo that the authenticated user matches the user in the provided field.

Some care needs to be taken with this approach. With directives, the default is that all of your data leaks, and you need to correctly plug every leak as you go. But if you design the previous Context-based solution correctly, by default it does nothing, and you have to expose the data that you want to be released.

Check out the “GraphQL directive permissions in Prisma” in the references below for a full code example.

You can also check out a similar approach in Hasura, where the implementation can be done from a GUI rather than through explicit directives. Check out “Hasura docs: Authentication & Authorization” below.

References

Problem 2: Denial of service attacks

Real world attack example: Large GraphQL query DOSes server

For DOS attacks, attackers are trying to disrupt your service by causing your service to struggle. For example, you can imagine that if a social network allowed you to query every single friend for a particular user via GraphQL, they could take down your network by producing a query like this:

query {
user(name: "Kevin Bacon") {
friends {
friends {
friends {
friends {
friends {
friends {
name
}
}
}
}
}
}
}
}

Roughly translated: “Hey, for some user, get me all of that user’s friends. For each of those user’s friends, get me all of their friends. For each of those people, get me all of their friends. And so on and so forth.”

To outline how much data this can return, The query above is an approximation of the “six degrees of Kevin Bacon” game. As in, this query would be expected to return every single user on the service, many times over. Even if you had perfect caching, this could still read billions of records and return quadrillions of objects (because lots of people know each other, so they’ll be returned hundreds of times each). If your server is doing that, it’s having a bad day.

References

How to protect against denial of service attacks

First, your service should have good HTTP hygiene. It should terminate requests that take too long, requests that are too large, etc.

Second, you should enable some type of query complexity analysis. I have a link to one specific query complexity analysis below, but in reality there are many. The idea is that you provide a “score” for each field. Maybe objects are 10 points each, each field is 1 point, and your search endpoint is 100 points. You then determine how many fields are being requested: requests that return lists should multiply the cost by the number of objects that are being requested. In the case above, we might calculate the complexity of the attack query as 1 * 10 * 200⁶ + 1 * 10 = 640,000,000,000,010

  • The first 1 * 10 is the cost of reading one object with 1 field
  • 200^6 is the cost of all of the nesting levels, simplified
  • The second 1 * 10 is the cost of reading the outer-most field.

You can perform this analysis before ever executing the query. Don’t find out the hard way that your server can’t respond to the request. Just do the math and reject the query.

Many GraphQL frameworks issue checks for this. They also do a similar test of allowing you to restrict the absolute depth that the query can go. This is largely redundant with the complexity test, but if all of your queries are shallow, it can’t hurt to enable.

Many GraphQL frameworks should also be able to handle concerns that are specific to them. For example, some GraphQL frameworks allow multiple requests to be batched and sent together. Frameworks should be able to limit the number of batches that are sent together. Smart frameworks should be able to allow you to configure the maximum complexity across the request.

For a best-in-class experience, look no further than GitHub. They have an advanced per-user rate limiting framework that is implemented using a complexity budget. This means that over a given period of time, each user has a specific amount of complexity that they can use. Each query is executed against this budget. Once you have exhausted the budget, you need to wait for the budget to reset.

References

Problem 3: Scraping attacks

Scraping attacks are similar in spirit to denial of service attacks, but the motivations are a little different. In this case, attackers want your servers to stay up, but they don’t care if the service is a little degraded. They’re trying to maximize how much data they can retrieve from you, so they’re going to push until your server starts to show signs of strain. This will cause problems for your other users, but the attackers are happy.

Let’s make this more concrete. Let’s say that a new hedge fund, SocialPath Inc, is trying to predict the stock price of a company before the company releases their quarterly earnings. Normally, what happens is that companies put out all of their numbers at once: profit, revenue, etc. And then everyone gets the data, and reads it together, and then the stock price moves as people adjust their estimations of future cash flows of the company.

The SocialPath hedge fund has a theory though: that your revenue is highly correlated with the number of items that you have sold that are marked “active” and have zero inventory in your GraphQL API. This would help them predict your revenue.

So they find a way to request all of it at once:

query {
item_1: item(id: 1) {
price is_active is_sold_out
}
item_2: item(id: 2) {
price is_active is_sold_out
}
item_3: item(id: 3) {
price is_active is_sold_out
}
# ... item_100000: item(id: 100000) {
price is_active is_sold_out
}
}

And using this approach, they can read the data for 100,000 products in a single request. After just a request or two, they have everything they need to project your revenue.

Attackers can issue this kind of threat against any dimensionality that your GraphQL queries have. If they can build a single deep query that scrapes it, they will. If they can issue a broad and shallow query, like the one pictured here, they will. If they can include a million separate requests that are all batched together, they wiill.

How to protect against scraping attacks

This section is simple: check out the solutions for “Problem 2”. They will also protect against this kind of attack. To recap, you want to make sure that you have good general server hygiene and terminate requests that go on too long, reject payloads that are too large, etc. On top of that, you should implement query complexity analysis and reject queries that request too many objects. Finally, if you want to go the extra mile, you can

I think it’s important to call these out as separate problems, even though they can be mitigated using similar measures.

Problem 4: Evolving your schema

So far, we’ve been focused heavily on attackers. But now we need to focus on the most dangerous attacker: ourselves. GraphQL is a flexible language. We must ensure that we can safely change a schema without breaking existing clients that we care about.

There might be existing clients that we don’t care about: if we allow people to execute queries by hand, we don’t really care if these break. The users can just hand-modify the part that broke.

But if our mobile app from 5 versions ago is still executing a query, which still has 8% of our traffic (because users haven’t upgraded to the most recent version of iOS yet), then that query had better work. An 8% loss of traffic can be catastrophic for a business, especially if the segment drives a disproportionately large share of your revenue.

Okay. So let’s just not break existing queries. Why is this so tricky? Because your codebase might not have the queries anymore — they may have changed in the subsequent 5 versions. There’s nothing in your current codebase that is verifying that these users are still using these fields.

How to safely evolve your schema

There are a few approaches that you can take.

First, you can take a policy stance that you never deprecate fields from your schema, or ever make them stricter in a way that might break clients — for example, changing a parameter from accepting a null value to being non-nullable. This might break clients that had been previously passing null and expecting it to work.

This can be a little impractical. Eventually, you’ll find a field that you want to get rid of because the maintenance burden is higher than you would prefer. You don’t want to be restricted from removing them, because what if nobody is even calling that field?

That’s where the next solution comes in. You could define all of your queries statically, such that clients refer to queries by hashes and not their actual values. The way this works is that your codebase will have a central repository of all of the GraphQL queries that can be executed. You never delete from this list, and you treat them as immutable. Instead, you allow yourself to remove fields or parameters if they are not referenced by one of these fields.

Okay, so we’ve given ourselves a way to remove queries that are completely unused. But can we do better?

Sure enough, we can. We can actually examine the queries that are sent to our service, and see if the field is used. This can be difficult, but if you want to go the free path, you could definitely put together a pipeline that scraped the queries from your logs and determined whether the fields were used.

There are also services that exist that allow you to get this behavior. Apollo GraphQL has a tool that compares queries that have executed against your schema, with the new version of the schema that you’d like to push. ApiHost, the site that this post is written on, is also building a tool allowing you to add this kind of analysis to your CI pipeline.

Reference

Problem 5: Caching

Now that we’ve mostly secured our service, we want to protect it against good news: the “hug of death.” When online services get really popular, their servers can get overwhelmed because too many clients are all executing the same requests. Some GraphQL clients have client-side caching, but this doesn’t help if thousands of new clients all need to warm their cache up. If we have a large influx of traffic from lots of clients, we should ensure that the server has the option of caching the data if it’s not too large.

How to effectively cache GraphQL queries from the server

At the technology level, you will either want to use HTTP Cache-Control headers if your client respects them, put a cache like Varnish between your server and your users (which itself will respect Cache-Control headers), or finally modifying your server itself to inspect the query and executing caching logic itself, perhaps with the help of a cache like Memcache or Redis.

You should also consider whether your queries can cache per-user data. If so, you will want to either stick to client-side caching for these queries (since the queries will be per-user anyways), or alternatively make sure that the cache “varies” by your user authorization. For example, if a header on the HTTP request contains the bearer token, consider making that part of the cache key to guarantee that per-user data does not leak between requests.

Some services like Hasura offer response caching by adding directives to the query.

Reference

Conclusion

In order, the five problems that your GraphQL server needs to solve before it can go live in production are:

  • Protecting against inconsistent authentication that might leak sensitive data
  • Stopping Denial of Service attacks that cause your service to experience downtiome
  • Protecting against scraping attacks designed to create a complete unauthorized mirror of your site’s data
  • Ensuring that you can safely change the schema without taking down an existing client
  • Caching popular queries so that your service can safely scale

Liked what you read? I’m building a service to solve these problems. If you want to check it out (or subscribe for more guides like this one), then head to https://www.apihost.dev and enter your email address.

--

--

Jacob Voytko

Runnin’ my own business. Previously staff engineer @ Etsy, before that I worked on Google Docs