Troubleshooting Router Error Responses

Page last updated:

This topic helps operators to better understand if 502’s are a result of the Swisscom Application Cloud Platform or an application.

Points of Failure

There are different points of failure in which 502’s can come from:

  1. Infrastructure
    • Load Balancer
    • Network
  2. Platform - CF
    • Gorouter
    • Diego Cell(s)
  3. Application

In the Infrastructure, 502’s can occur in the following way:

  • From the Load Balancer, 502’s can surface when the Gorouters are not receiving traffic at all.
    • This can be observed if the Load Balancer is logging 502’s but the Gorouters are not.

In the Platform, 502’s can occur in the following ways:

  • If the Gorouter is unable to connect to the application container:
    • TCP dial issues (can’t make an initial connection to the backend). The Gorouter will retry TCP dial errors up to three times, if it still fails then a 502 will be returned to the client and logged to the access.log. This may be due to:
      • An application that is unresponsive (which indicates an issue with the application)
      • The Gorouter has a stale route (which indicates an issue with the platform)
      • The application container is corrupted (which indicates a problem with the platform)
      • These types of errors may look like this within the gorouter.log:
        [2018-07-05 17:59:10+0000] {"log_level":3,"timestamp":1530813550.92134,"message":
        "backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint":
        {"ApplicationId":"","Addr":"10.0.32.15:60099","Tags":null,"RouteServiceUrl":""},
        "error":"dial tcp 10.0.32.15:60099: getsockopt: connection refused"}}
        
  • If the Gorouter successfully dials the endpoint but an error occurs:
    • read: connection reset by peer errors can occur when the application closes the connection abruptly with a TCP RST packet and not the expected FIN-ACK. This will cause the Gorouter to retry the next endpoint. Note, Gorouter does not currently retry on write: connection reset by peer failures.
    • TLS Handshake errors. When these errors occur, the Gorouter will retry up to three times and if it’s still failing then a 502 may be returned. These errors appear similar to the following in the gorouter.log (and a 502 will be logged in the access.log):
      [2018-07-05 18:20:54+0000] {"log_level":3,"timestamp":1530814854.4359834,"message":"
      backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint":
      {"ApplicationId":"","Addr":"10.0.16.17:61002","Tags":null,"RouteServiceUrl":""},
      "error":"x509:certificate is valid for 53079ca3-c4fe-4910-78b9-c1a6, not xxx"}}
      
    • If the Gorouter successfully connects to the endpoint, but an error occurs while the request is in transport (i.e. Gorouter has not received a response from the endpoint):
    • Prior to routing-release 0.166.0, there was a bug that logged a 502 for requests canceled by clients before the server responded with headers. routing-release 0.166.0 and beyond, if the same situation occurs, a 499 is returned.

In an Application, 502’s can occur in the following ways (Note: the Gorouter will not retry any error response that is returned by the application):

  • If 502’s are only occuring from a particular application instance(s) and not all of the applications on the platform, then it is likely an application-related error (i.e. application is overloaded, unresponsive, can’t connect to database, etc.).
  • If all applications are experiencing 502’s, then it could either be a platform issue (possible misconfiguration) or an application issue (i.e. all applications are unable to connect to an upstream database).

General Debugging Steps

Here are general debugging steps for any issue resulting with 502 error codes:

  • Gather the Gorouter logs & Diego Cell logs at the time of the incident.
  • Review the logs and consider the following questions:
    1. Which errors are the Gorouters returning?
    2. Is the Gorouter’s routing table accurate (are the endpoints for the route as expected)?
    3. Do the Diego Cell logs have anything interesting about unexpected app crashes and/or restarts?
    4. Is the application healthy and handling requests successfully? (try using request tracing headers to verify)
  • Was there a recent platform change/upgrade that caused an increase in 502’s?
  • Are there any suspicious metrics spiking? How is CPU & Memory utilzation?

Gorouter Error Classification Table

Error Type Status Code Source of Issue Evidence
Dial 502 Application or Platform logs with error dial tcp
ConnectionResetOnRead 502 Application or Platform logs with error read: connection reset by peer
AttemptedTLSWith
NonTLSBackend
525 Platform* - logs with error tls: first record does not look like a TLS handshake
- backend_tls_handshake_failed metric increments
HostnameMismatch 503 Platform - logs with error x509: certificate is valid for <x> not <y>
- backend_invalid_id metric increments
UntrustedCert 526 Platform - logs with an error prefix x509: certificate signed by unknown authority
- backend_invalid_tls_cert metric increments
RemoteFailedCertCheck 496 Platform logs with error remote error: tls: bad certificate
ContextCancelled 499 Client/App logs with error context canceled

Note: This status code is never returned to clients, only logged, as it occurs when the downstream client closes the connection before Gorouter responds.

RemoteHandshakeFailure 525 Platform - logs with error remote error: tls: handshake failure
- backend_tls_handshake_failed metric increments

*Note: any platform issue could be the result of a misconfiguration

For each of the above errors, there will be a backend-endpoint-failure log line in gorouter.log and an error message in gorouter.err.log. Additionally, the access.log will record the request status codes.

View the source for this page in GitHub