AWS Open Distro for OpenTelemetry

Prometheus Remote Write Exporter Advanced Configurations for AMP

Prometheus Remote Write Exporter Advanced Configurations for AMP

In this guide, we provide some advanced configurations for the AWS Distro for OpenTelemetry (ADOT) Collector-AWS Managed Service for Prometheus (AMP) Pipeline.

For an overview on what the pipeline is or for more basic configurations, please take a look at this Getting Started with the AWS Distro for OpenTelemetry Collector-AMP Pipeline in EKS Guide.




Prometheus Receiver Configurations

The Prometheus Receiver provides many configurations to perform service discovery, metric scraping, and metric re-labelling.

Note that each of these configurations requires its own Role-Based Access Control (RBAC) permissions in order to access the kube-api and discover scrape targets. These requirements can be found here.

Additional Kubernetes/EKS Scraping Configurations

To monitor your Kubernetes applications and clusters, we specifically use the kubernetes_sd_configs. We can choose between various Kubernetes objects to discover and scrape including endpoints, pods, nodes, services and ingresses. For each of these objects, we provide a default configuration.

Endpoints

The Prometheus Receiver monitors each applications deployment using the service endpoints. Specifically, it scrapes and collects metrics from the /metrics endpoint. In order to create and expose these metrics, we use the Prometheus client libraries.

1- job_name: 'kubernetes-service-endpoints'
2 kubernetes_sd_configs:
3 - role: endpoints
4
5 tls_config:
6 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
7 insecure_skip_verify: true
8 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
9
10 relabel_configs:
11 # Example relabel to scrape only endpoints that have
12 # "prometheus.io/scrape = true" annotation.
13 # - action: keep
14 # regex: true
15 # source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
16 # Example relabel to configure scrape scheme for all service scrape targets
17 # based on endpoints "prometheus.io/scrape_scheme = <scheme>" annotation.
18 # - action: replace
19 # regex: (https?)
20 # source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
21 # target_label: __scheme__
22 # Example relabel to customize metric path based on endpoints
23 # "prometheus.io/metric_path = <metric path>" annotation.
24 # - action: replace
25 # regex: (.+)
26 # source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
27 # target_label: __metrics_path__
28 # Example relabel to scrape only single, desired port for the service based
29 # on endpoints "prometheus.io/scrape_port = <port>" annotation.
30 # - action: replace
31 # regex: ([^:]+)(?::\d+)?;(\d+)
32 # replacement: $$1:$$2
33 # source_labels: [__address__,__meta_kubernetes_service_annotation_prometheus_io_port]
34 # target_label: __address__
35 - action: labelmap
36 regex: __meta_kubernetes_pod_label_(.+)
37 - action: replace
38 source_labels: [__meta_kubernetes_namespace]
39 target_label: Namespace
40 - action: replace
41 source_labels: [__meta_kubernetes_service_name]
42 target_label: Service
43 - action: replace
44 source_labels: [__meta_kubernetes_pod_node_name]
45 target_label: kubernetes_node
46 - action: replace
47 source_labels: [__meta_kubernetes_pod_name]
48 target_label: pod_name
49 - action: replace
50 source_labels: [__meta_kubernetes_pod_container_name]
51 target_label: container_name
52
53 # Exclude high cardinality metrics
54 metric_relabel_configs:
55 - source_labels: [__name__]
56 regex: 'go_gc_duration_seconds.*'
57 action: drop

Pods

A pod is a group of one or more containers, with shared storage/network resources, and a specification for how to run the containers. When monitoring pods, we want to watch the pod deployment patterns, total pod instances, and expected vs. actual pod instances.

1- job_name: 'kubernetes-pods'
2 sample_limit: 10000
3 kubernetes_sd_configs:
4 - role: pod
5 tls_config:
6 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
7 insecure_skip_verify: true
8 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
9 relabel_configs:
10 # Example relabel to scrape only endpoints that have
11 # "prometheus.io/scrape = true" annotation.
12 # - action: keep
13 # regex: true
14 # source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
15 # Example relabel to configure scrape scheme for all service scrape targets
16 # based on endpoints "prometheus.io/scrape_scheme = <scheme>" annotation.
17 # - action: replace
18 # regex: (https?)
19 # source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
20 # target_label: __scheme__
21 # Example relabel to customize metric path based on endpoints
22 # "prometheus.io/metric_path = <metric path>" annotation.
23 # - action: replace
24 # regex: (.+)
25 # source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
26 # target_label: __metrics_path__
27 # Example relabel to scrape only single, desired port for the service based
28 # on endpoints "prometheus.io/scrape_port = <port>" annotation.
29 # - action: labelmap
30 # regex: __meta_kubernetes_pod_label_(.+)
31 - action: replace
32 source_labels: [__meta_kubernetes_namespace]
33 target_label: Namespace
34 - action: replace
35 source_labels: [__meta_kubernetes_pod_name]
36 target_label: pod_name
37 - action: replace
38 source_labels: [__meta_kubernetes_pod_container_name]
39 target_label: container_name
40 - action: replace
41 source_labels: [__meta_kubernetes_pod_controller_name]
42 target_label: pod_controller_name
43 - action: replace
44 source_labels: [__meta_kubernetes_pod_controller_kind]
45 target_label: pod_controller_kind
46 - action: replace
47 source_labels: [__meta_kubernetes_pod_phase]
48 target_label: pod_phase
49
50 metric_relabel_configs:
51 - action: drop
52 source_labels: [__name__]
53 regex: 'go_gc_duration_seconds.*'

Kubernetes (k8s) API Server

The kube-apiserver provides REST operations and the front-end to the cluster’s shared state through which all other components interact. Key metrics to watch for include: the number and duration of requests for each combination of resource (including pods, Deployments, etc.) as well as the operation (such as GET, LIST, POST, DELETE).

The TLS configurations give us access to the k8s objects.

1- job_name: 'kubernetes-apiservers'
2 sample_limit: 10000
3 # Default to scraping over https. If required, just disable this or change to
4 # `http`.
5 scheme: https
6
7 kubernetes_sd_configs:
8 - role: endpoints
9 tls_config:
10 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
11 insecure_skip_verify: true
12 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
13
14 relabel_configs:
15 - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
16 action: keep
17 regex: kubernetes;https

Some types of metrics we can receive from the API server include:

MetricDescriptionMetric Type
apiserver_request_duration_secondsCount of requests to the API server for a specific resource and verbWork: Throughput
workqueue_queue_duration_secondsTotal number of seconds that items spent waiting in a specific work queueWork: Performance
workqueue_work_duration_secondsTotal number of seconds spent processing items in a specific work queueWork: Performance

cAdvisor

The cAdvisor is an agent integrated into the kubelet binary to monitor the resource usage and analyze the performance of containers. Key metrics collected by the cAdvisor include the CPU, memory, file, and network usage for containers running on a given node.

1- job_name: 'kubernetes-cadvisor'
2 sample_limit: 10000
3 # Default to scraping over https. If required, just disable this or change to
4 # `http`.
5 scheme: https
6 metrics_path: /metrics/cadvisor
7
8 kubernetes_sd_configs:
9 - role: node
10 tls_config:
11 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
12 insecure_skip_verify: true
13 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
14
15 relabel_configs:
16 - action: labelmap
17 regex: __meta_kubernetes_node_label_(.+)

Types of metrics we can receive from the cAdvisor include:

MetricDescription
container_cpu_load_average_10sValue of container cpu load average over the last 10 seconds.
container_cpu_system_seconds_totalCumulative system cpu time consumed in seconds.
container_last_seenLast time a container was seen by the exporter
container_memory_failcntNumber of memory usage hits limits
container_memory_failures_totalCumulative count of memory allocation failures.

Nodes

Kubernetes nodes are the virtual or physical machines that run our workloads. Key metrics to watch for nodes mainly report on resource utilization including allocatable memory/CPU and CPU/disk utilization.

1- job_name: 'kubernetes-nodes'
2 sample_limit: 10000
3 # Default to scraping over https. If required, just disable this or change to
4 # `http`.
5 scheme: https
6
7 kubernetes_sd_configs:
8 - role: node
9 tls_config:
10 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
11 insecure_skip_verify: true
12 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
13
14 relabel_configs:
15 - action: labelmap
16 regex: __meta_kubernetes_node_label_(.+)

Types of metrics we receive from the cAdvisor include:

MetricDescription
kubelet_cgroup_manager_duration_secondsDuration in seconds for cgroup manager operations. Broken down by method.
kubelet_node_config_errorThis metric is true (1) if the node is experiencing a configuration-related error, false (0) otherwise.
kubelet_pleg_relist_duration_secondsDuration in seconds for relisting pods in PLEG (pod lifecycle event generator).
kubelet_pod_start_duration_secondsDuration in seconds for a single pod to go from pending to running.
kubelet_pod_worker_duration_secondsDuration in seconds to sync a single pod. Broken down by operation type: create, update, or sync
kubelet_running_pod_countNumber of pods currently running
kubelet_runtime_operations_duration_secondsDuration in seconds of runtime operations. Broken down by operation type.
kubelet_runtime_operations_errors_totalCumulative number of runtime operation errors by operation type. This can be a good indicator of low level issues in the node, like problems with container runtime.
kubelet_runtime_operations_totalTotal count of runtime operations of each type.
storage_operation_duration_seconds_countStorage operation duration
storage_operation_errors_totalStorage operation errors

Services

A service is an abstract way to expose an application running on a set of pods as a network service. The services define a logical set of pods and a policy by which to access them, enabling decoupling between pods. The important metrics to consider for services is the health of that service. We can use the Blackbox Exporter provided by Prometheus to probe endpoints over HTTP, HTTPS, DNS, TCP and ICMP.

Prerequisite

To use the blackbox exporter, ** we can deploy the configurations here.

Using the name of the Blackbox Exporter service and the exposed port (9115), we can access the probed metrics. This is done by replacing the __address__. If we do not want to probe all services, we can specify a list of target addresses to probe in the static_configs (more information can be found here).

1- job_name: 'kubernetes-services'
2 sample_limit: 10000
3 metrics_path: /probe
4 params:
5 module: [http_2xx]
6
7 kubernetes_sd_configs:
8 - role: service
9 tls_config:
10 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
11 insecure_skip_verify: true
12 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
13
14 relabel_configs:
15 # Example relabel to probe only some services that have "prometheus.io/should_be_probed = true" annotation
16 # - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_should_be_probed]
17 # action: keep
18 # regex: true
19 - source_labels: [__address__]
20 target_label: __param_target
21 - target_label: __address__
22 replacement: blackbox-exporter-service:9115
23 - source_labels: [__param_target]
24 target_label: instance
25 - action: labelmap
26 regex: __meta_kubernetes_service_label_(.+)
27 - source_labels: [__meta_kubernetes_namespace]
28 target_label: kubernetes_namespace
29 - source_labels: [__meta_kubernetes_service_name]
30 target_label: kubernetes_name

Some types of metrics we can receive from the service probing include:

MetricDescription
probe_duration_secondsReturns how long the probe took to complete in seconds
probe_http_status_codeResponse HTTP status code
probe_http_uncompressed_body_lengthLength of uncompressed response body
probe_successDisplays whether or not the probe was a success

Ingresses

A Kubernetes ingress is an API object that manages external access to the services in a cluster. Similar to services, the important metrics to consider is the health of the ingress. We can use the Blackbox Exporter provided by Prometheus to probe endpoints over HTTP, HTTPS, DNS, TCP and ICMP.

As setup in services, this will also require the Blackbox Exporter.

1- job_name: 'kubernetes-ingresses'
2 sample_limit: 10000
3 metrics_path: /probe
4 params:
5 module: [http_2xx]
6
7 kubernetes_sd_configs:
8 - role: ingress
9 tls_config:
10 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
11 insecure_skip_verify: true
12 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
13
14 relabel_configs:
15 # Example relabel to probe only some services that have "prometheus.io/should_be_probed = true" annotation
16 # - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_should_be_probed]
17 # action: keep
18 # regex: true
19 - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
20 regex: (.+);(.+);(.+)
21 replacement: $${1}://$${2}$${3}
22 target_label: __param_target
23 - target_label: __address__
24 replacement: blackbox-exporter-service:9115
25 - source_labels: [__param_target]
26 target_label: instance
27 - action: labelmap
28 regex: __meta_kubernetes_ingress_label_(.+)
29 - source_labels: [__meta_kubernetes_namespace]
30 target_label: kubernetes_namespace
31 - source_labels: [__meta_kubernetes_ingress_name]
32 target_label: kubernetes_name

Some types of metrics that we can receive from the service probing include:

MetricDescription
probe_duration_secondsReturns how long the probe took to complete in seconds
probe_http_status_codeResponse HTTP status code
probe_http_uncompressed_body_lengthLength of uncompressed response body
probe_successDisplays whether or not the probe was a success

Notice that these metrics are similar to the service metrics above (as they are both probing metrics). The difference will lie in the labels.




Permissions

Kubernetes API Server

If you are scraping for kube-system components and the API server, the endpoints need to be enabled for private access. More information on this can be found here.

RBAC for Other Kubernetes Resources

In order for service discovery and scraping to work, the ADOT Collector pod may need permissions to get and list objects of the EKS cluster. By default, the OTel Collector uses the default service account to communicate with the API server. We can set a ClusterRoleBinding for this service account such that it can access and scrape the necessary metric endpoints.

Example configuration:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otelcol-rbac
subjects:
- kind: ServiceAccount
name: default
namespace: otelcol
roleRef:
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
---

If necessary, we can restrict the service account access to the specific Kubernetes resources we want to scrape. For instance, if we were specifically scraping for pod-level metrics, we could use the following ClusterRole and ClusterRoleBinding:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prom-admin
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list"]
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prom-rbac
subjects:
- kind: ServiceAccount
name: default
namespace: otelcol
roleRef:
kind: ClusterRole
name: prom-admin
apiGroup: rbac.authorization.k8s.io
---



AWS Prometheus Remote Write Exporter Configurations

For the AWS Prometheus Remote Write Exporter to sign your HTTP requests with AWS SigV4 (AWS’ authentication protocol for secure authentication), you will need to provide the aws_auth configurations. If aws_auth is not provided, HTTPs requests will not be signed.

1exporters:
2 awsprometheusremotewrite:
3 endpoint: "https://aws-managed-prometheus-endpoint/v1/api/remote_write"
4 aws_auth:
5 service: "aps"
6 region: "user-region"

Aside from the auth configurations, the AWS Prometheus Remote Write Exporter is also configurable with retry, sending queue, and timeout settings. An example of these configurations is provided below.

1exporters:
2 awsprometheusremotewrite:
3 endpoint: "https://aps-workspaces-gamma.us-east-1.amazonaws.com/workspaces/ws-7cd45747-2381-4a2a-847f-fa61a3694a74/api/v1/remote_write"
4 namespace: test
5 auth:
6 region: "us-east-1"
7 service: "aps"
8 retry_on_failure:
9 enabled: true
10 initial_interval: 5s
11 max_interval: 10s
12 max_elabsed_time: 30s
13 timeout: 15s

More information on the possible retry, sending queue, and timeout configurations can be found here.




Takeaway

These advanced configurations should enable you to monitor your applications and Kubernetes cluster in a secure and reliable manner. If you would like a more basic setup, please take a look at the getting started with the AWS Distro for OpenTelemetry Collector-AMP Pipeline in EKS Guide.

We would love to hear more common configuration scenarios or improvements to this documentation from you! Please submit an issue on the aws-otel community to let us know.