What are timeouts you ask?
Timeouts are a set of configurations that a developer provides when one service calls another service.
Connection Timeout
Say,
Service A is calling service B.
Service B has a total of 10 connections and assume all of them are used.
Now, if service A initiates a connection to service B, service B has no connection to offer to service A.
One practical example is when you call someone over the phone and it keeps ringing. If no one picks up till the ring time of, say 30 seconds, the call disconnects.
What you would likely do is try again after some time, hoping someone would pick up the call.
In such a situation, connection timeout is the duration for which service A will keep trying to make the connection to service B.
What happens if the connection timeout is too high or not present?
Service A will keep on piling up requests to service B. It will make the recovery of service B difficult in case of an outage. It might even lead to cascading failure of service A.
What to do if the logs show a connection timeout?
If you see a lot of connection timeouts in logs, you should be concerned.
As an owner of service A, you should check if you are making more requests to service B than what has been agreed with the owner of service B.
As an owner of service B, you may increase the number of pods as an immediate fix.
What is the right value of a connection timeout?
Anything in three-digit milliseconds should be sufficient.
Read Timeout
Say,
Service A is calling service B.
The connection is established
Read timeout is the time for which service A will wait for service B to return data after the connection is made.
A practical example is an old-style pay phone where we had to a coin to get a talk time of 60 seconds. We dial a number after dropping a coin in the phone box. Someone picks it up and then the call automatically disconnects after, say 60 seconds, even if the conversation is not over. This is an example of a read timeout.
What happens if the read timeout is too high or not present?
Service A will keep on piling up requests to service B. It will make the recovery of service B difficult in case of an outage. It might even lead to cascading failure of service A.
What happens if the read timeout is too small?
Service A will keep calling service B and service B will not return any response.
What to do if the logs show a read timeout?
If you see a lot of read timeouts in logs, that means that the response time from service B has increased.
As an owner of service A, you should find out if you need to increase the value of the read timeout.
As an owner of service B, you should check the 99 percentile of API response if there is any code change. You should also check if the particular API makes an external call and if there is any issue with those external services.
What is the right value of a read timeout?
This depends on the API. The owner of service B should do a load test on the API and report the 99 percentile to service A. Service A should set 99 percentile plus an additional buffer as their read timeout.
Further Reading
Reference 1 - Daniel Lebrero
Reference 2 - Stackoverflow
Reference 3 - Stackoverflow