Troubleshooting router error responses in Cloud Foundry

Page last updated:

Are your 502 errors the result of your infrastructure, Cloud Foundry, or an app? This topic helps you understand and debug these errors.

Outer error responses overview

In your deployment, 502 errors can come from any of the following:

If you are unsure of the source of 502 errors, see General Debugging Steps.

General debugging steps

Some general debugging steps for any issue resulting in 502 errors are as follows:

  1. Gather the Gorouter logs and Diego Cell logs at the time of the incident.

  2. Review the logs and consider the following:

    1. Which errors are the Gorouters returning?
    2. Is Gorouter’s routing table accurate? Are the endpoints for the route as expected? For more information, see Dynamic Routing Table in the Gorouter documentation on GitHub.
    3. Do the Diego Cell logs have anything interesting about unexpected app crashes or restarts?
    4. Is the app healthy and handling requests successfully? You can use request tracing headers to verify. For more information, see HTTP Headers for Zipkin Tracing in HTTP Routing.
  3. Consider the following:

    • Does your load balancer log 502 errors, but Gorouter does not? This means that traffic is not reaching Gorouter.
    • Was there a recent platform change or upgrade that caused an increase in 502 errors?
    • Are there any suspicious metrics spiking? How is the CPU and memory utilization?

Log formatting

Levels

The following table describes the log levels supported by Gorouter. The log level is specified in the configuration YAML file for Gorouter.

Message Description Examples
fatal Gorouter is unable to handle any requests due to a fatal error. Gorouter cannot bind to its TCP port, a Cloud Foundry component has published invalid data to Gorouter.
error An unexpected error has occurred. Gorouter failed to fetch token from UAA service.
info An expected event has occurred. Gorouter started or exited, Gorouter has begun to prune routes for stale droplets.
debug A lower-level event has occurred. Route registration, route unregistration.

Message contents

This section provides a sample Gorouter log entry and explanation of the contents.

[2017-02-01 22:54:08+0000] {"log_level":0,"timestamp":1485989648.0895808,"message":"endpoint-registered","source":"vcap.Gorouter.registry","data":{"uri":"0-*.login.bosh-lite.com","backend":"10.123.0.134:8080","modification_tag":{"guid":"","index":0}}}

Property Description
log_level Logging level of the message
timestamp Epoch time of the log
message Content of the log entry
source Gorouter function that initiated the log entry
data Additional information that varies based on the message

Access logs

This section provides details about Gorouter access logs.

Gorouter generates an access log in the following format when it receives a request:

<Request Host> - [<Start Date>] "<Request Method> <Request URL> <Request Protocol>" <Status Code> <Bytes Received> <Bytes Sent> "<Referrer>" "<User-Agent>" <Remote Address> <Backend Address> x_forwarded_for:"<X-Forwarded-For>" x_forwarded_proto:"<X-Forwarded-Proto>" vcap_request_id:<X-Vcap-Request-ID> response_time:<Response Time> gorouter_time:<Gorouter Time> app_id:<Application ID> app_index:<Application Index> x_cf_routererror:<X-Cf-RouterError> <Extra Headers>

Gorouter access logs are also redirected to syslog.

See the following list for more information about the Gorouter access log fields:

  • The following are optional fields: Status Code, Response Time, Application ID, Application Index, X-Cf-RouterError, and Extra Headers.

  • If the access log lacks a Status Code, Response Time, Application ID, Application Index, or X-Cf-RouterError, the corresponding field shows -.

  • Response Time is the total time it takes for the request to go through the Gorouter to the app and for the response to travel back through the Gorouter. This has the time that the request spends traversing the network to the app and back again to the Gorouter. It also has the time the app spends forming a response.

  • Gorouter Time is the total time it takes for the request to go through the Gorouter initially plus the time it takes for the response to travel back through the Gorouter. This does not have the time the request spends traversing the network to the app. This also does not have the time the app spends forming a response.

  • X-Cf-RouterError is populated if the Gorouter encounters an error. The returned values can help distinguish whether a non 2xx response code is due to an error in the Gorouter or the back end. For more information on the possible errors, see the Diagnose App Errors section.

Diagnose Gorouter errors

This section describes the basic structure of Gorouter logs and how to diagnose Gorouter errors.

Gorouter cannot connect to the app container

If Gorouter cannot connect to the app container, you might see this error in the gorouter.log:

[2018-07-05 17:59:10+0000] {"log_level":3,"timestamp":1530813550.92134,"message":
"backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint":
{"ApplicationId":"","Addr":"10.0.32.15:60099","Tags":null,"RouteServiceUrl":""},
"error":"dial tcp 10.0.32.15:60099: getsockopt: connection refused"}}

If TCP cannot make an initial connection to the backend, Gorouter retries TCP dial errors up to max_attemps which defaults to 3. If it still fails, Gorouter returns a 502 to the client and writes to the access.log.

Any of the following can cause connection errors between Gorouter and the app container:

  • An app that is unresponsive, indicates an issue with the app.
  • A stale route in Gorouter, indicates an issue with the platform. For more information, see Diagnose Stale Routes.
  • A corrupted app container, indicates a problem with the platform.

Gorouter errors after connecting

If Gorouter successfully dials the endpoint but an error occurs, you might see the following:

  • read: connection reset by peer errors. When these errors occur, the Gorouter returns error to the client, marks backend ineligible, and does not retry another back end.

    If a TCP connection can be established (with Envoy), but the app itself does not send a complete http response, Gorouter retries the request on different app instances if any exist.

    A request is considered idempotent under the following circumstances (see also Gorouter source code):

    Request body is empty AND (Request method is one of “GET”, “HEAD”, “OPTIONS”, “TRACE” OR Request Header X-Idempotency-Key or Idempotency-Key is set)

    The most likely scenario for this to occur is during creation of a new app instance, while the app is not yet completely started, but Envoy is already accepting connections.

  • TLS handshake errors. When these errors occur, the Gorouter retries up to max_attemps which defaults to 3. If it still fails, Gorouter can return a 502. These errors appear similar to the following in the gorouter.log, and a 502 error is logged in the access.log:

    [2018-07-05 18:20:54+0000] {"log_level":3,"timestamp":1530814854.4359834,"message":"
    backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint":
    {"ApplicationId":"","Addr":"10.0.16.17:61002","Tags":null,"RouteServiceUrl":""},
    "error":"x509:certificate is valid for 53079ca3-c4fe-4910-78b9-c1a6, not xxx"}}
    

  • If a clients cancels a request before the server responds with headers, Gorouter returns a 499 error.

Diagnose stale routes in Gorouter

A stale route occurs when Gorouter contains out-of-date route information for a backend app. In nearly all cases, stale routes are self correcting.

If TLS from Gorouter to apps and other backends is enabled, then when Gorouter detects that it is sending traffic to the wrong app, it prunes that backend app from its route table and terminates the connection. TLS from Gorouter to apps and other backends is enabled by default in cf-deployment v7.0.0.

Causes of stale routes

When a route is unmapped or when an app container is deleted because the app is deleted or moved, a deregister message is sent to Gorouter. This message tells Gorouter to delete the route mapping to that container.

If Gorouter does not receive this deregister message, the route is now considered stale. Gorouter still attempts to send traffic to the app.

You are more likely to have stale routes when the following are true:

  • SSL verification is not activated.
  • You do not have Diego Release v2.34.0 or later deployed, which contains a fix that sends unregistration messages multiple times.
  • You unmapped a route to an app, but traffic to that route is still being sent to the app.

How to locate stale routes

The following procedure helps you identify stale routes:

  1. Verify the state of the deployment. Run cf routes for all spaces and ensure the route is only mapped to the intended apps. Sometimes there can be multiple routes using the same hostname and domain, but with different paths. If the domain is shared, check all orgs as well.
  2. Examine the Gorouter routes table. It might be necessary to check multiple Gorouters, as it is possible that some received the proper deregister message and some did not.
    1. SSH to the VM where Gorouter is running.
    2. To print the entire Gorouter routes table, run: /var/vcap/jobs/gorouter/bin/retrieve-local-routes | jq .
    3. Find the entry for the suspected stale route. Note the values for address and private_instance_id.
  3. Cross-reference the Gorouter routes table entry with actual Long-Running Processes (LRPs):
    1. SSH onto the Diego Cell where the IP address matches the IP address that you found on the routes table entry.
    2. To get information about all of the actual LRPs, run: cfdot actual-lrps | jq .'
    3. Look through the actual LRPs to find the instance ID that you noted from the routes table. If that instance ID exists and the port in the route table does not exist in the ports section, then there is likely a stale route.

Important The CAPI endpoint GET /v3/processes/:guid/stats provides information about the host and ports the app is using, and is intended for use by app developers. This endpoint does not provide complete information for operators. Operators should use the cfdot CLI on the Diego Cell to view the LRPs directly and all at once.

How to fix stale routes

In most cases, you do not have to fix stale routes. Stale routes are normal in the routing table.

Gorouter fixes stale routes on its own when SSL verification is activated. When Gorouter attempts to send traffic to a stale endpoint, it recognizes that the endpoint is stale. Gorouter prunes the stale route and then tries another endpoint. Stale routes are only a problem if an end user receives errors.

The following procedure helps you fix stale routes:

  1. Ensure that SSL verification is activated. For more information, see With TLS Enabled in HTTP Routing.

    Important Using TLS to verify app identity depends on SSL verification. If you deactivate SSL verification, there is no way to avoid misrouting.

  2. If there is a stale route, then restarting Gorouter fixes the immediate issue. If you restart all of the Gorouters, and see the same issue for the exact same route, then the issue is not a stale route. The issue is that a bad route is continuing to be added to the route table no matter how many time Gorouter prunes it.
  3. If Gorouter is continually missing deregister messages, it might be because either the NATS message bus or the Gorouters are overwhelmed. Look at the VM usage and consider scaling.

Gorouter error classification table

Use this table when you are debugging Gorouter errors. The table lists error types, status codes, and indicates if the Gorouter retries the errored request.

For each error, a backend-endpoint-failure log entry appears in gorouter.log and an error message in gorouter.err.log. Additionally, the access.log records the request status codes. For more information about debugging Gorouter errors, see the Gorouter documentation on GitHub.

If the request is one that can be retried, Gorouter makes up to max_attemps which defaults to 3.

Error Type Status Code Can be re‑tried? Source of Issue Evidence
Dial 502 Yes App or Platform Logs with error dial tcp
AttemptedTLSWith
NonTLSBackend
525 Yes Platform Logs with error tls: first record does not look like a TLS handshake or backend_tls_handshake_failed metric increments
HostnameMismatch 503 Yes Platform Logs with error x509: certificate is valid for < x >, not < y >
or backend_invalid_id metric increments
UntrustedCert 526 Yes Platform Logs with error prefix x509: certificate signed by unknown authority or backend_invalid_tls_cert metric increments
RemoteFailedCertCheck 496 Yes Platform Logs with error remote error: tls: bad certificate
ContextCancelled 499 No Client/App Logs with error context canceled
This status code appears in logs only. It is never returned to clients as it occurs when the downstream client closes the connection before Gorouter responds.
RemoteHandshakeFailure 525 Yes Platform Logs with error remote error: tls: handshake failure and backend_tls_handshake_failed metric increments
ExpiredOrNotYetValidCertFailure 502 Yes Platform Logs with error x509: certificate has expired or is not yet valid. For example, this error can occur if the Diego Cell clock drifts.
unknown 502 No App or Platform If the error is not one of the types mentioned earlier, then 502 is the default response. Gorouter tracks 502 errors with the gorouter.bad_gateways metric. See Router Error: 502 Bad Gateway.

Diagnose app errors

This section describes app-related 502 errors.

When 502 errors only occur in specific app instances and not all app instances on the platform, it is likely an app-related error. The app might be overloaded, unresponsive, or unable to connect to the database.

When all apps are experiencing 502 errors, then it could either be a platform issue. For example, a misconfiguration, or an app issue likely all apps being unable to connect to an upstream database.

If the Gorouter has back end keep-alive connections activated, 502 errors can occur due to a race condition when the app instance keep alive idle timeout is less than 90 seconds. For more information, see Gorouter Back End Keep-Alive Connections.

For any other errors that appear in Gorouter access logs, you can analyze application logs if they need further investigation.

Gorouter does not retry any error response returned by the app.

Gorouter specific response headers

In the case that Gorouter encounters an error connecting to an application backend, the X-CF-RouterError header is populated to help distinguish the origin of a non-2xx response code.

The value of the X-Cf-Routererror header can be one of the following:

Value Description
invalidcfappinstanceheader The provided value for the X-Cf-App-Instance header does not match the required format of APP_GUID:INSTANCE_ID.
empty_host The value for the Host header is empty, or the Host header is equivalent to the remote address. Some LB’s optimistically set the Host header value with their IP address when there is no value present.
unknown_route The desired route does not exist in the gorouter’s route table.
no_endpoints The entry is in the route table for the desired route, but there are no healthy endpoints available.
Connection Limit Reached The backends associated with the route have reached their max number of connections. The max connection number is set via the spec property router.backends.max_conns.
routeserviceunsupported Route services are not enabled or WebSockets requests are bound to route services. You can configure route services using the spec property router.route_services_secret. If the property is empty, route services are deactivated.
endpoint_failure The registered endpoint for the desired route failed to handle the request.
View the source for this page in GitHub