Troubleshooting - General
Identify the causes and fixes for general problems with Service Mesh. The following general troubleshooting solutions are available.
Changes Made to Mesh Resources in the OCI Console or OCI CLI Revert to their Previous State
Issue
Any changes made to mesh resources (for example: ingress gateway, virtual service, virtual deployment, and so on) from the OCI Console or the OCI CLI revert to their previous state based on the update interval set for the operator.
Explanation
Currently, after initial creation of a mesh resource, changes to resources
can only be made from OCI Service Operator for Kubernetes and therefore
kubectl.
Based on the operator update interval, for
example, every hour, the OCI Service Operator for Kubernetes runs a
reconciliation process. Any resources in the service mesh control plane with
different settings are reverted to match the settings in the OCI Service
Operator for Kubernetes.
Having General Traffic Issues with Service Mesh
Issue
Missing or unexpected service traffic is usually a sign of improper routing settings. Following are the most common reasons why a service communication is not working as expected.
Common Reasons
SSL Related Reasons
Troubleshooting Ingress Gateway Deployments
Issue
The IngressGatewayDeployment resource creates dependent resources like Deployment, Service, and Horizontal Pod Autoscaler. The Service created by IngressGatewayDeployment can in turn create a LoadBalancer resource. If any of these dependent resources fail to create, the IngressGatewayDeployment resource doesn't become active. To remediate some common issues, review the following:
Solution
If the deployment produces an error similar to the following, this error means that the service of type LoadBalancer created by IngressGatewayDeployment fails to create a public load balancer in a private subnet.
Warning SyncLoadBalancerFailed 3m2s (x10 over 48m) service-controller (combined from similar events): Error syncing load balancer: failed to ensure load balancer: creating load balancer: Service error:InvalidParameter. Private subnet with id <subnet-ocid> is not allowed in a public loadbalancer.. http status code: 400. Opc request id: <opc-request-id>
To use a private or internal load balancer, do the following.
- Remove the service section from the IngressGatewayDeployment resource.
- Create a Service with the correct annotations that points to the ingress gateway pods.
Your updated resources look similar to the following examples.
IngressGatewayDeployment without service
apiVersion: servicemesh.oci.oracle.com/v1beta1
kind: IngressGatewayDeployment
metadata:
name: bookinfo-ingress-gateway-deployment
namespace: bookinfo
spec:
ingressGateway:
ref:
name: bookinfo-ingress-gateway
deployment:
autoscaling:
minPods: 1
maxPods: 1
ports:
- protocol: TCP
port: 9080
serviceport: 80
Service to create an internal load balancer.
apiVersion: v1
kind: Service
metadata:
name: bookinfo-ingress
namespace: bookinfo
annotations:
service.beta.kubernetes.io/oci-load-balancer-internal: "true"
spec:
ports:
- port: 80
targetPort: 9080
name: http
selector:
servicemesh.oci.oracle.com/ingress-gateway-deployment: bookinfo-ingress-gateway-deployment
type: LoadBalancer
Horizontal Pod Autoscaler (HPA) does not Scrape Metrics
Issue
The Horizontal Pod Autoscaler (HPA) does not scrape metrics.
Solution
When an application pod is set up with Service Mesh, the Service Mesh proxy container is injected into the pod. Along with the proxy container, an init container is also injected which does a one time initialization required for enabling the proxy.
Because of the presence of the init container in the pod the metrics-server is unable to scrape metrics from the pod in some scenarios, refer to the following table.
metrics-server Version | HPA API Version | Able to Scrape Metrics |
---|---|---|
v0.6.x | autoscaling/v2beta2 | No |
v0.6.x | autoscaling/v1 | Yes |
v0.4.x | Any | No |
Virtual Deployment Pods Receive No Traffic
Issue
My virtual deployment pods receive no traffic.
Solution
By default, the routing policy for a virtual service is
DENY
. Therefore, do one of the following:
- Change the routing policy to
UNIFORM
. - Create a virtual service route table to route traffic to your virtual deployment.
Troubleshoot Traffic Issues with Proxy config_dump
Issue
You're experiencing one of the following traffic issues.
- A service isn't receiving any traffic.
- Secure communication isn't happening between services.
- Traffic splitting isn't happening across versions.
- A/B deployment testing, canary deployment fails.
Solution
To troubleshoot the issue, get the config_dump
file for the
pod with the issue. You can infer more information by looking at the source
and destination pod config_dump
files. To get the file,
perform the following steps.
- Run in the
oci-sm-proxy
container.$ kubectl exec -i -t -n NAMESPACE_NAME POD_NAME -c oci-sm-proxy "--" /bin/bash
- Inside the container, access the
config_dump
file.$ curl localhost:9901/config_dump
- Exit from the container.
$ exit
Analyze Traffic Between Service Versions with Prometheus
Issue
Prometheus metrics are key for identifying whether traffic is sent to a particular version of a service in the last 5 minutes.
Solution
To view service traffic in the last five minutes, perform the following steps.
The following examples assume a service named "pets" is deployed and has multiple versions.
Open the Prometheus Dashboard in the browser using port-forwarding:
kubectl port-forward PROMETHEUS_POD_NAME -n PROMETHEUS_NAMESPACE PROMETHEUS_CONTAINER_PORT:9090
- To view prometheus metrics, visit
http://localhost:9090/graph
in the browser. - To view the total count of all requests sent to the pets-v1 service,
enter the following command in prometheus search. Press
Execute.
envoy_cluster_external_upstream_rq_completed{virtual_deployment_name="pets-v1"}
To view rate of requests over the past 5 minutes to the pets-v1 service, enter the following command in prometheus search. Press Execute.
rate(envoy_cluster_external_upstream_rq_completed{virtual_deployment_name="pets-v1"}[5m])
Fetch total number of requests sent to all pods with curl:
curl localhost:9090/api/v1/query?query="envoy_cluster_external_upstream_rq_completed" | jq