us.42crunch.cloud is down

Incident Report for 42Crunch

Postmortem

November 23th, 2021

We want to provide you with some additional information about the service disruption that occurred in the 42Crunch Enterprise platform (us.42crunch.cloud) on November 1st, 2021.

Issue Summary

On Monday, 1 November, 2021, 42Crunch SaaS platform (us.42crunch.cloud) instances in all regions lost connectivity for a total of approximately 90 minutes, from 21:20 to 22:50 Central European Time.

We noticed lot’s of "Back-off restarting failed container" messages for the kube-proxy pods by the time of the outage.

Because of the kube-proxy errors above we began to see lot’s of connection refused errors in GKE components.

Error messages like: dial tcp 10.104.0.1:443: connect: connection refused

For example, kubedns, autoscaler, metrics-server, event-exporter and others could not connect to the default Kubernetes service.

Our pods were UP but as kube-proxy was down it cannot forward traffic to our pods.

On our side during the outage we noticed that:

all pods were up and running
we don’t see any restarts (for pods and nodes)
we don’t see application errors in the logs
our app didn’t send any 50x HTTP error code (our stack was fully up all the time)

On the nodes level we can see errors about connection to our control plane being refused.

A case was opened to the Google support team. In it, we described the incident and requested more information on the reason for the unavailability of the platform.

Control plane became unavailable at 12:20 PM PT and became available again at 1:56 PM PT on November 1st. The times line up exactly to when there was an outage on GKE that affected clusters in us-west2.

Google Support informed us that it was a problem from their side in GSLB (Google Global Software Load Balancer). GSLB allows Google to balance live user traffic between clusters so that Google can match user demand to available service capacity, and so they can handle service failures in a way that’s transparent to users.

GSLB used by hosted master service on GKE (Google Kubernetes Engine) was affected due to a network configuration change made by Google causing breaking on hosted masters in the us-west2 region (where us.42crunch.cloud is deployed) were affected as well. This change was pushed minutes before the outage and at 12:20 PM and we had an end-user impact. Client's services were not targeted. However their masters are in a GKE owned project that was affected. The hosted masters temporarily had network connectivity disrupted.

The outage caused traffic loss to the control plane, it was unavailable so things like gcloud, kubectl, IAM service account authentication was unavailable, as were services such as master repairs, and autoscaling. Existing workloads are believed to be unaffected.

Nodes were showing failure to connect to Control Plane VIP (virtual IP address) which is another symptom of the outage.

Fortunately the GKE team rolled back the configuration change quickly and at 1:54 PM PT the issue was deemed mitigated. This is exactly the time our control plane regained availability and it has remained healthy since.

From Google's side, this outage led to many high priority action items for the product team, including improving alerting on traffic drops to hosted master service, and to implement better testing to better predict how these changes will roll out.

Unfortunately our cluster was affected, and unfortunately there was nothing we could have done to avoid this. Fortunately the GKE team was able to identify the bad config push, rollback the change and mitigate the issue within 2 hours.

In closing

We want to apologize for the impact this event caused for our customers. While we are proud of our track record of availability, we know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.

Posted Dec 29, 2021 - 18:57 UTC

Resolved

This incident has been resolved

Posted Nov 01, 2021 - 21:45 UTC

Monitoring

We are still monitoring the outage and in contact with Google Cloud support to fix the issue

Posted Nov 01, 2021 - 21:00 UTC

Identified

We've identified that the problem is on GKE (Google Kubernetes Engine) where our platform is hosted

Posted Nov 01, 2021 - 20:45 UTC

Investigating

We are currently investigating an outage with our Enterprise platform (us.42crunch.cloud)

Posted Nov 01, 2021 - 20:15 UTC

This incident affected: 42Crunch Enterprise Platform (Enterprise Platform).