by Denys Kurylenko, senior staff engineer
Every piece of software you use in isolation may have perfect code. But when those softwares have to work together, situations can appear that create a systematic bug – forcing you to create workarounds to help them “play nicely.”
Following is an example of a situation we encountered where we had to do just that: create a way to help the software “speak the same language.” Let’s explore what happened, and how we handled it.
Last year, we experienced a spike of AWS 400/InvalidSignatureException errors when calling DynamoDB; we dug through the code to figure out why. The AWS Golang SDK signs HTTP requests (headers and payload) before sending them to AWS servers. If your outbound middleware adds any headers – such as trace IDs – the AWS server will ignore them (since they are unsigned). However, during a retry, AWS SDK will sign the previous request’s headers. So, if your middleware modifies any headers it previously set, the AWS server will now reject the request, since the signed headers no longer match.
These errors were especially rare under normal circumstances, but would show up consistently during traffic spikes due to load testing or business events*, leading to potential issues such as rate limiting, outages, and other errors – which can affect public perception and customer trust.
*Examples of business events are large promotions with partners and merchants, marketing campaigns that go viral (like Fast’s Parker Goes Fast campaign), or holiday events like Black Friday/Cyber Monday, all of which can require sudden, additional bandwidth.
After AWS rejects a set threshold* number of requests, services may be pulled into the AWS Tarpit, which further degrades the success rate of subsequent valid, legitimate requests.
While a more sustainable approach might be to patch the AWS SDK to sign headers during retry only if it previously signed those headers on prior requests, we needed a stop-gap solution.
We achieved this by stopping the addition of extra headers to outbound AWS servers if those headers could change from request to request.
*AWS DynamoDB thresholds are based on the provisioning inputs. Provisioning can be dynamic: i.e., AWS auto-scales up, based on traffic. It takes 5-10 minutes for DynamoDB to scale up based on traffic, and during that time, we might see throttling errors.
Or, it can be static: we over-provision (and overpay), even if we don't use all of the set capacity. One example of over-provisioning is Black Friday: we predicted increased holiday traffic to ensure we would not experience outages.
Under heavy load, we noticed a significant increase in InvalidSignatureException errors returned from AWS APIs such as DynamoDB. “Heavy load” in this case refers to an increase in requests per second relative to our baseline, not an overload of DynamoDB itself. This results in throttling requests and forcing caller retry.
Our theories on why
Server/pod system time mismatch
- If our server clocks had diverged from DynamoDB’s clocks, then auth signatures (which include a validity period) could appear invalid.
- This ended up being incorrect, but brought us to all Nitro instances, which improved performance and improved clock source with kvm-clock.
Token being used right up until expiry time
- Sending a token just before it expires could result in the token being expired by the time DynamoDB validates it.
- We ended up extending our AWS Security token validity time out of an abundance of caution.
AWS credentials expiring more frequently
- This turned out to be an incorrect assumption. We tried to increase the expiration window to 24 hours, but saw errors occurring at the same pace.
Interference caused by Istio
- It’s possible that Istio (a service mesh that manages egress traffic to AWS services) could have been taking some action during increased traffic periods that caused DynamoDB requests to fail.
- We didn’t explore this deeply.
Too many requests from a single IP address in a short period of time
- This issue was reproducible from different machines using both shared AWS credentials and machine-specific ones.
Bug in DynamoDB HTTP client-connection reuse
- Connection reuse in library code could cause bad tokens to be reused.
- Forcing DynamoDB to create a new HTTP connection with every single request didn’t resolve the issue.
Why did it happen?
You might not be right the first time – you have to be willing to explore seemingly impossible possibilities.
We started with our call stack:
service_code -> aws_sdk -> http_client -> istio -> dynamo db
We initially thought there could be a couple of suspects (namely aws_sdk -> http_client andhttp_client -> istio) interfering with requests.
Testing the theory
Looking at two requests in the screenshots below, one request includes x-datadog-* in the Authorization request – but not the other. The one that doesn’t include it shows 200 OK, but the one that does shows a 400 BadRequest.
So, it looks like the headers are the issue. But how did the headers get injected into the request?
The http_client client injects them.
But that begs another question: why are those headers signed in one request, yet not in another?
This leads us to aws-sdk Request.Send():
This is a pretty standard while-true loop that signs/sends/retries requests. If we go inside sign func, it leads us to AWS SignerV4, which blindly signs all HTTP headers with the exception of User-Agent/X-Amzn-Trace-Id/Authorization, and r.SendRequest() will eventually call our http_client.
That means after the first iteration of this loop, our request will collect numerous, different headers, and all of them will be signed – except those identified as x-datadog-*. If the request returns 200 OK (99.9% of our requests), we’re good to go.
If the request needs to be retried, the code flow will go into the second iteration. That second iteration will call sign() func again, and sign any x-datadog-* headers – again, so far, so good.
However, once we enter http_client for the second time, a new Opentracing span will be created, meaning span_id will change. OpenTracing HTTPCarrier will override the x-datadog-parent-id with a new span_id, rendering the signature stored in the "Authorization" header invalid.
Why do we see this only under high load?
Looking at the logs information that precedes the InvalidSignatureException, we see:
This tells us that when under high load, DynamoDB will try to scale up – but at the same time, throttle our requests – because it will consider the request rate to be too high.
400/com.amazonaws.dynamodb.v20120810#ProvisionedThroughputExceededException is a retryable exception, and DynamoDB DefaultRetryer will attempt to retry it, ultimately leading us into the second iteration of the while-true loop described above.
The following flowchart explains the life cycle of the request:
What went wrong – and can we identify these kinds of errors sooner or more easily?
Both pieces of code in isolation are perfectly valid, but their combination creates a perfect storm:
The AWS Golang SDK signs HTTP requests (headers and payload) before sending them to AWS servers. If your outbound middleware adds any headers, such as trace/span IDs, the AWS server will ignore them, since they are unsigned. However, during a retry, the AWS SDK will sign the previous request’s headers. So, if your middleware modifies any headers it previously sent, the AWS server will now reject the request, since the signed headers no longer match the headers that were provided.
Can we identify the situation sooner? Unfortunately, no. There are multiple layers of abstraction:
- Our code sits on top of the AWS GO Library,
- The AWS GO Library sits on top of the AWS HTTP REST API, and
- AWS HTTP REST API sits on top of the AWS firewall/throttler
And every layer makes various assumptions that just don't play nicely when coupled.
The only practical solution, then, is to open the debugger and step through the code – multiple times – until the picture is clear in someone's mind.
How we fixed the problem
We added an AWS session SendHandler that persists signed headers into context.Context. Subsequent retries read previously signed headers back, and ensure that the value of those headers never change, even if the header itself does change (i.e., Opentracing span_id changed).
In this article, Aspen Mesh describes an error use case similar to what we experienced: https://aspenmesh.io/distributed-tracing-with-istio-in-aws/
Not every fix renders the situation solved once and for all. Sometimes, you have to create stop gaps. And in other situations, the actual solution creates new bumps that you simply have to address. By making room for solutions to not be a catch-all, you empower your team to get creative with their approaches, and provide psychological safety to try various methods to get to the desired outcome – until one day, a more worthy cure is found.
Denys Kurylenko works at Fast on the Reliability and Performance Team, focused on technical infrastructure for their one-click online checkout platform. Denys' extensive expertise runs across multiple tech stacks, with an emphasis on creating modular, scalable software with no shortcuts to the quality of code. Denys holds a Master of Science in Computer Science and resides in the Seattle area.