A Guide to GraphQL Schema Federation, Part 3

Caching Query Plans

Welcome to the third part in a series of posts on GraphQL Schema Federation. In this post we will explore a performance optimization for our gateway that prevents the re-calculation of query plans for repetitive queries.

While caching query plans does improve your API's performance, we can also use this cache as a list of approved queries, preventing user's from getting creative with our API in unapproved ways.

Measuring the Planning Phase

As with any performance optimization, we are going to begin by measuring the current behavior. Before we add a cache to the services we've been building in these series, let's take a quick detour and explore some of the performance characteristics of our gateway. For now, just clone this repository somewhere on your machine with the following command:

git clone github.com/alecaivazis/query-plan-benchmark && cd query-plan-benchmark

If you look inside, you will notice that there is a single file called benchmark.go where we define separate schemas that represent different services in a toy environment.

In order to keep things relatively simple, this benchmark sets the the Executor used by the gateway to always return a fixed value from memory. This avoids any kind of network requests or considerations of the contents of the plan. This way we can only concern ourselves with the generation of the plan itself.

Apart from being separated across multiple fragments, the query that we are going to use in our tests looks relatively harmless:

query {
    allUsers {
        ...UserList
    }
}

fragment UserList on User {
    firstName
    market {
        ...MarketDetails
        owner {
            market {
                owner {
                    market {
                        owner {
                            market {
                                owner {
                                    market {
                                        owner {
                                            market {
                                                owner {
                                                    market {
                                                        owner {
                                                            firstName
                                                        }
                                                    }
                                                }
                                            }
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

fragment MarketDetails on Market {
    name
    logo {
        thumbnailURL
    }
}

If you look at how this information is distributed across our services, you'll notices that this query requires a least three separate steps: one for allUsers, one for the market of each User, and a final one for for the details of each Market. On top of that, the final step of market info is very deep. And without going into too much detail, the addition of fragments into this query further complicates the planning logic since the Planner must ensure that services do not get versions of fragments with fields they do not recognize.

To see how long planning this query takes, run the benchmark:

go run benchmark.go

On my desktop at home, it takes around 18ms.

Addmittedly this is very small compared to the overall response time of the API. In fact, if this was the upper limit on query complexity for your API, then you probably don't need to be caching query plans at all. Rather than showing you a contrived query that takes awhile and using it to convince you that you need to be caching your plans, I suggest that you take a moment now to grab one of the larger queries from a project you are working on and plug it into this benchmark. If you cannot get the script to run in an uncomfortable amount of time on a machine that resembles your gateway's production environment, you probably don't need to worry about caching your query plans. If you are still interested in how we can reduce the impact of the planning phase, please keep reading.

How to Cache Query Plans

Now that we have decided we do in fact want to cache our query plans, we need to talk about what the various options are. Query plan caches broadly fit into two categories that I refer to as "static" and "dynamic". As usual, each comes with their own benefits and trade-offs.

Static Query Plan Cache

Static Query Plan Caches are the simpler of the two and involve the pre-computation of all possible queries that a client can perform. Once we have generated this list of queries, we can create some string that identifies the query and pass the pair to the API when it starts up. The client can then use this string as a reference to the actual query instead of sending the full text. Doing this not only saves us from sending a body of text over the wire but it also prevents the server from doing the same parsing and planning logic over and over.

An important implementation detail of a server that implements a static query place cache is how it handles a request that is not part of this initial list. While it might seem to go against the flexibility of GraphQL to reject queries that the server did not initially know about, this added layer of security can be extremely useful for APIs that are not necessarily designed for public consumption. If this extra rigidity is not appropriate for your API, you might want to try an automatic query plan cache which was designed to allow for the same low-bandwidth communication without the headache of a pre-defined query list.

Automatic Query Plan Cache

While the core interaction of an automatic query plan caches still depends on the exchange of a short string as a reference to a pre-generated query plan, if the server encounters a query it has not seen before, it will cache the plan generated and provide the user a hash it can use next time instead of sending the full query body.

This approach not only retains the flexibility of our GraphQL API by caching new query plans as they are generated, but it also removes a rather complicated build step that keeps the clients queries in sync with the approved list on the server. Unfortunately, this does come at a cost. Since the client is free to send any queries they want, it is no longer possible to have a list of "approved" queries. On top of that, the list of cached query is unbounded. We have to make sure that some kind of garbage collection is in place that cleans up old queries that may not be used again.

Enabling a Query Plan Cache

Now that we have gotten the theory out of the way, let's go ahead and see how we would add a query plan cache to the gateway we've been building throughout the series. Go ahead and clone the starting code which should be where we left off in part 2:

git clone https://github.com/alecaivazis/schema-federation-demo && cd schema-federation-demo && git checkout query-plan-cache

In the nautilus gateway, query plan caches are driven by the QueryPlanCache interface:

type QueryPlanCache interface {
	Retrieve(ctx *PlanningContext, hash string, planner QueryPlanner) ([]*QueryPlan, error)
}

Out of the box, you can opt into an automatic query plan cache by adding this single line in the call to gateway.New around line 70 of gateway.go:

gateway.WithAutomaticQueryPlanCache()

Once you have added that, start the gateway by following the instructions in the README.

Working With the Cache

With everything up and running, you should be able to execute the following command in your terminal and see a list of every auction in our platform:

curl -d '{"query":"{ allAuctions { id } }" }' -H "Content-Type: application/json" -X POST http://localhost:3000/graphql

Usually, a GraphQL response includes only two fields: "data" and "errors". However, if you look closely at the response, you'll notice a third field called "extensions". Don't worry, this is not a special field that nautilus has made up. In fact, this field is explicitly defined in the spec as an object that is free for services to use for their own purposes. Nautilus takes advantage of this and uses this field to share caching information with the client.

If you look at the response, you'll see that there is a "persistedQuery" key inside of the "extensions" object. The "sha256Hash" key within that object holds a string that can be used instead of the query we just performed.

If you now take the value of the "sha256Hash" key in the response and paste it into this command, you should get the same response as before

curl -d '{"extensions":"{ "persistedQuery": {"version":1,"sha256Hash": "<hash>"}" }' -H "Content-Type: application/json" -X POST http://localhost:3000/graphql

Notice that we did not provide an actual query body. The hash that you pasted in the command was enough to retrieve the query plan, saving us the bandwidth as well as the cost of parsing the query again. Keep in mind that the AutomaticQueryPlanCache will look the query up in the cache regardless of wether you passed in the hash expliticlty or not. You only need to pass in the hash if you care to save bandwidth on each request.

If the names seem out of place in the context of a federated gateway, I'm sorry. These keys and structure of this interaction are designed to seemlessly integrate into Apollo's support for Automatic Persisted Queries which does not know its being used for a query plan cache.

Conclusion

In this blog post we started off with a small exercise that should help you make an informed descision if you truly need to cache your query plans or not. We then explored the difference between an "automatic" and "static" query plan cache. We finished up by adding an automatic query plan cache to our services and saw how we can extract the query plan cache key from the response to dramatically reduce our request body size.

That's it for now! Thank you so much for getting this far with me. At the moment, the next post in this series is not ready yet. I do know that it will either showcase a way to force your gateway to auto-update as your services pop into existence, or a follow-up to part 2 that shows how to use custom directives to secure your API. Stay tuned!