Published on 2024-01-10 by GCH

Reliable invocation of HTTP APIs

As the new year sets in and winter once again proves to be cold, this blog is forced to give up on the silly season and return to its original greyness, to discuss the pains of unreliable APIs. By unreliable APIs we mean, of course, all APIs, but especially APIs exposed by cloud services, which are shared across many customers.

While it is true that no API is 100% reliable, it is also true that remote APIs maintained by 3rd parties are exposed to the savagery of the Internet in general, to accidental overuse by legitimate customers, and to every network problem we've ever heard of, including high latency and packet loss.

This could originate responses other than the desired one being sent to the API client or it could cause the client to hang until timing out, situations which lead to the failure of any processes dependent upon successful invocation of the remote API.

We use 3rd party systems to save us the effort of building them. But we do so at the cost of dealing with unreliability. This cost grows with the frequency of their invocation as well as with the impact of occurring failures. For instance, if a 2 hour long automated process depends for a specific task on input from an external API and that process is executed 30 times a month, we are exposed to a loss of 60 * X hours a month, where X is the rate of API failures. A 5% API failure rate would cause an average direct loss of 3 hours of work, plus human resource hours spent on relaunching the process. Should the same process depend upon two APIs instead of one and should they have the same failure rate, work loss would be twice as high.

Fortunately, transient remote API errors can be mitigated using a simple retry + exponential backoff pattern, which is available in the python requests module. We would simply define the following,

list of HTTP error codes we think should trigger retrying
number of retries - N
maximum time we allow the invocation to hang until we accept that it has failed - T
a dimensioning parameter called the backoff factor - b

and then use these definitions in code similar to what we show below:

requester = requests.session()

retries = Retry(total=nr_retries, backoff_factor=backoff_factor,
                status_forcelist=FORCE_RETRY_STATUS_CODE_LIST)

requester.mount("https://", HTTPAdapter(max_retries=retries))
requester.mount("http://",  HTTPAdapter(max_retries=retries))
    
try:
    response = requester.get(url=url, timeout=client_timeout)
    print("Request done")

except (HTTPError, MaxRetryError) as err:
    print("There was a problem in the request to service: " + str(err))

The waiting time between retry i and retry i+1 is given by:

\( t_i = b 2^{i-1} \)

for i >= 1, i.e., for retries after the first one. According to the official documentation there is no waiting time between the initial invocation and the first retry. This means that for N retries, when N>=2, the total waiting time will be given by a partial sum of the geometric series:

\( t = \sum\limits_{i = 1}^{N-1} b 2^{i-1} = b \sum\limits_{i = 0}^{N-2} 2^{i} = b{ {1-2^{N-1}} \over {1-2} } = b(2^{N-1}-1) \)

The time estimated above does not include the time necessary for the connection to be established and the time for the HTTP reply to be completed. This additional time may or may not be relevant depending on how the remote system behaves: it could be replying quickly with HTTP error 502, it could be hanging for several seconds during the connection phase or it could be sending the HTTP reply very slowly.

In its simplest form, the handling of slow remote systems using the requests module is done by passing a client timeout argument that is considered for both the establishment of the connection and the receival of the response. If that value is, say, T/2, the worst case processing time for the connection will be aproximately T (it can't be exactly T because if the establishment of the connection took T/2 seconds it would be considered a failure and no data would be transferred - but when establishing the connection takes almost T/2, the worst case for establishing the connection plus transferring the response would be almost T).

Thus, we can estimate the worst case mitigation time to be:

\( t = b(2^{N-1}-1) + NT \)

We need to adjust these parameters as we learn from the specific 3rd party services we use, but in most cases the order of magnitude of the mitigation time is seconds. For instance, a very conservative choice of N=5, T=6, b=1 would lead to a worst case mitigation time of 45s. If this additional time saves executions of critical multi-hour processes, it is certainly worth the wait.

Now, practice is often more appealing than theory so I invite you to have a look at this simple CLI application which allows for the simulation of remote HTTP failures using the amazing httpstat.us service. This application will make it easier to understand the relationship between given parameters and total mitigation time. You can simulate the example mentioned above by executing:

python3 remote-API-invocation.py -nr 5 -ct 3 -bf 1

where the -ct parameter corresponds to T/2.

Final note: be sure not to apply retries to POST requests unless you are absolute sure it is safe. POST requests are often used for operations that cause state changes (e.g. creating database entries) and retrying them may cause inconsistencies.