AWS Distro for OpenTelemetry

Prometheus Remote Write Exporter Advanced Configurations for AMP

In this guide, we provide some advanced configurations for the AWS Distro for OpenTelemetry (ADOT) Collector-AWS Managed Service for Prometheus (AMP) Pipeline.

For an overview on what the pipeline is or for more basic configurations, please take a look at this Getting Started with the AWS Distro for OpenTelemetry Collector-AMP Pipeline in EKS Guide.

Prometheus Receiver Configurations

The Prometheus Receiver provides many configurations to perform service discovery, metric scraping, and metric re-labelling.

Note that each of these configurations requires its own Role-Based Access Control (RBAC) permissions in order to access the kube-api and discover scrape targets. These requirements can be found here.

Additional Kubernetes/EKS Scraping Configurations

To monitor your Kubernetes applications and clusters, we specifically use the kubernetes_sd_configs. We can choose between various Kubernetes objects to discover and scrape including endpoints, pods, nodes, services and ingresses. For each of these objects, we provide a default configuration.

Endpoints

The Prometheus Receiver monitors each applications deployment using the service endpoints. Specifically, it scrapes and collects metrics from the /metrics endpoint. In order to create and expose these metrics, we use the Prometheus client libraries.

1- job_name: 'kubernetes-service-endpoints'
2  kubernetes_sd_configs:
3  - role: endpoints
4
5  relabel_configs:
6  # Example relabel to scrape only endpoints that have
7  # "prometheus.io/scrape = true" annotation.
8  # - action: keep
9  #   regex: true
10  #   source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
11  # Example relabel to configure scrape scheme for all service scrape targets
12  # based on endpoints "prometheus.io/scrape_scheme = <scheme>" annotation.
13  # - action: replace
14  #   regex: (https?)
15  #   source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
16  #   target_label: __scheme__
17  # Example relabel to customize metric path based on endpoints
18  # "prometheus.io/path = <metric path>" annotation.
19  # - action: replace
20  #   regex: (.+)
21  #   source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
22  #   target_label: __metrics_path__
23  # Example relabel to scrape only single, desired port for the service based
24  # on endpoints "prometheus.io/scrape_port = <port>" annotation.
25  # - action: replace
26  #   regex: ([^:]+)(?::\d+)?;(\d+)
27  #   replacement: $$1:$$2
28  #   source_labels: [__address__,__meta_kubernetes_service_annotation_prometheus_io_port]
29  #   target_label: __address__
30  - action: labelmap
31    regex: __meta_kubernetes_pod_label_(.+)      
32  - action: replace
33    source_labels: [__meta_kubernetes_namespace]
34    target_label: Namespace
35  - action: replace
36    source_labels: [__meta_kubernetes_service_name]
37    target_label: Service
38  - action: replace
39    source_labels: [__meta_kubernetes_pod_node_name]
40    target_label: kubernetes_node
41  - action: replace
42    source_labels: [__meta_kubernetes_pod_name]
43    target_label: pod_name
44  - action: replace
45    source_labels: [__meta_kubernetes_pod_container_name]
46    target_label: container_name
47
48  # Exclude high cardinality metrics
49  metric_relabel_configs:
50  - source_labels: [__name__]
51    regex: 'go_gc_duration_seconds.*'
52    action: drop

Pods

A pod is a group of one or more containers, with shared storage/network resources, and a specification for how to run the containers. When monitoring pods, we want to watch the pod deployment patterns, total pod instances, and expected vs. actual pod instances.

1- job_name: 'kubernetes-pods'
2  sample_limit: 10000
3  kubernetes_sd_configs:
4  - role: pod
5  relabel_configs:
6  # Example relabel to scrape only endpoints that have
7  # "prometheus.io/scrape = true" annotation.
8  # - action: keep
9  #  regex: true
10  #  source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
11  # Example relabel to configure scrape scheme for all service scrape targets
12  # based on endpoints "prometheus.io/scrape_scheme = <scheme>" annotation.
13  # - action: replace
14  #  regex: (https?)
15  #  source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
16  #  target_label: __scheme__
17  # Example relabel to customize metric path based on endpoints
18  # "prometheus.io/path = <metric path>" annotation.
19  # - action: replace
20  #  regex: (.+)
21  #  source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
22  #  target_label: __metrics_path__
23  # Example relabel to scrape only single, desired port for the service based
24  # on endpoints "prometheus.io/scrape_port = <port>" annotation.
25  # - action: labelmap
26  #  regex: __meta_kubernetes_pod_label_(.+)
27  - action: replace
28    source_labels: [__meta_kubernetes_namespace]
29    target_label: Namespace
30  - action: replace
31    source_labels: [__meta_kubernetes_pod_name]
32    target_label: pod_name
33  - action: replace
34    source_labels: [__meta_kubernetes_pod_container_name]
35    target_label: container_name
36  - action: replace
37    source_labels: [__meta_kubernetes_pod_controller_name]
38    target_label: pod_controller_name
39  - action: replace
40    source_labels: [__meta_kubernetes_pod_controller_kind]
41    target_label: pod_controller_kind
42  - action: replace
43    source_labels: [__meta_kubernetes_pod_phase]
44    target_label: pod_phase
45
46  metric_relabel_configs:
47  - action: drop
48    source_labels: [__name__]
49    regex: 'go_gc_duration_seconds.*'

Kubernetes (k8s) API Server

The kube-apiserver provides REST operations and the front-end to the cluster’s shared state through which all other components interact. Key metrics to watch for include: the number and duration of requests for each combination of resource (including pods, Deployments, etc.) as well as the operation (such as GET, LIST, POST, DELETE).

The TLS configurations give us access to the k8s objects.

1- job_name: 'kubernetes-apiservers'
2  sample_limit: 10000
3  # Default to scraping over https. If required, just disable this or change to
4  # `http`.
5  scheme: https
6  
7  kubernetes_sd_configs:
8  - role: endpoints
9  tls_config:
10    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
11    insecure_skip_verify: true
12  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
13  
14  relabel_configs:
15  - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
16    action: keep
17    regex: kubernetes;https

Some types of metrics we can receive from the API server include:

Metric	Description	Metric Type
apiserver_request_duration_seconds	Count of requests to the API server for a specific resource and verb	Work: Throughput
workqueue_queue_duration_seconds	Total number of seconds that items spent waiting in a specific work queue	Work: Performance
workqueue_work_duration_seconds	Total number of seconds spent processing items in a specific work queue	Work: Performance

cAdvisor

The cAdvisor is an agent integrated into the kubelet binary to monitor the resource usage and analyze the performance of containers. Key metrics collected by the cAdvisor include the CPU, memory, file, and network usage for containers running on a given node.

1- job_name: 'kubernetes-cadvisor'
2  sample_limit: 10000
3  # Default to scraping over https. If required, just disable this or change to
4  # `http`.
5  scheme: https
6  metrics_path: /metrics/cadvisor
7  
8  kubernetes_sd_configs:
9  - role: node
10  tls_config:
11    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
12    insecure_skip_verify: true
13  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
14  
15  relabel_configs:
16    - action: labelmap
17      regex: __meta_kubernetes_node_label_(.+)

Types of metrics we can receive from the cAdvisor include:

Metric	Description
container_cpu_load_average_10s	Value of container cpu load average over the last 10 seconds.
container_cpu_system_seconds_total	Cumulative system cpu time consumed in seconds.
container_last_seen	Last time a container was seen by the exporter
container_memory_failcnt	Number of memory usage hits limits
container_memory_failures_total	Cumulative count of memory allocation failures.

Nodes

Kubernetes nodes are the virtual or physical machines that run our workloads. Key metrics to watch for nodes mainly report on resource utilization including allocatable memory/CPU and CPU/disk utilization.

1- job_name: 'kubernetes-nodes'
2  sample_limit: 10000
3  # Default to scraping over https. If required, just disable this or change to
4  # `http`.
5  scheme: https
6  
7  kubernetes_sd_configs:
8  - role: node
9  tls_config:
10    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
11    insecure_skip_verify: true
12  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
13  
14  relabel_configs:
15    - action: labelmap
16      regex: __meta_kubernetes_node_label_(.+)

Types of metrics we receive from the cAdvisor include:

Metric	Description
kubelet_cgroup_manager_duration_seconds	Duration in seconds for cgroup manager operations. Broken down by method.
kubelet_node_config_error	This metric is true (1) if the node is experiencing a configuration-related error, false (0) otherwise.
kubelet_pleg_relist_duration_seconds	Duration in seconds for relisting pods in PLEG (pod lifecycle event generator).
kubelet_pod_start_duration_seconds	Duration in seconds for a single pod to go from pending to running.
kubelet_pod_worker_duration_seconds	Duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync
kubelet_running_pod_count	Number of pods currently running
kubelet_runtime_operations_duration_seconds	Duration in seconds of runtime operations. Broken down by operation type.
kubelet_runtime_operations_errors_total	Cumulative number of runtime operation errors by operation type. This can be a good indicator of low level issues in the node, like problems with container runtime.
kubelet_runtime_operations_total	Total count of runtime operations of each type.
storage_operation_duration_seconds_count	Storage operation duration
storage_operation_errors_total	Storage operation errors

Services

A service is an abstract way to expose an application running on a set of pods as a network service. The services define a logical set of pods and a policy by which to access them, enabling decoupling between pods. The important metrics to consider for services is the health of that service. We can use the Blackbox Exporter provided by Prometheus to probe endpoints over HTTP, HTTPS, DNS, TCP and ICMP.

Prerequisite

To use the blackbox exporter, ** we can deploy the configurations here.

Using the name of the Blackbox Exporter service and the exposed port (9115), we can access the probed metrics. This is done by replacing the __address__. If we do not want to probe all services, we can specify a list of target addresses to probe in the static_configs (more information can be found here).

1- job_name: 'kubernetes-services'
2  sample_limit: 10000
3  metrics_path: /probe
4  params:
5    module: [http_2xx]
6    
7  kubernetes_sd_configs:
8  - role: service
9  
10  relabel_configs:
11  # Example relabel to probe only some services that have "prometheus.io/should_be_probed = true" annotation
12  # - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_should_be_probed]
13  #  action: keep
14  #  regex: true
15  - source_labels: [__address__]
16    target_label: __param_target
17  - target_label: __address__
18    replacement: blackbox-exporter-service:9115
19  - source_labels: [__param_target]
20    target_label: instance
21  - action: labelmap
22    regex: __meta_kubernetes_service_label_(.+)
23  - source_labels: [__meta_kubernetes_namespace]
24    target_label: kubernetes_namespace
25  - source_labels: [__meta_kubernetes_service_name]
26    target_label: kubernetes_name

Some types of metrics we can receive from the service probing include:

Metric	Description
probe_duration_seconds	Returns how long the probe took to complete in seconds
probe_http_status_code	Response HTTP status code
probe_http_uncompressed_body_length	Length of uncompressed response body
probe_success	Displays whether or not the probe was a success

Ingresses

A Kubernetes ingress is an API object that manages external access to the services in a cluster. Similar to services, the important metrics to consider is the health of the ingress. We can use the Blackbox Exporter provided by Prometheus to probe endpoints over HTTP, HTTPS, DNS, TCP and ICMP.

As setup in services, this will also require the Blackbox Exporter.

1- job_name: 'kubernetes-ingresses'
2  sample_limit: 10000
3  metrics_path: /probe
4  params:
5    module: [http_2xx]
6
7  kubernetes_sd_configs:
8  - role: ingress
9  
10  relabel_configs:
11  # Example relabel to probe only some services that have "prometheus.io/should_be_probed = true" annotation
12  # - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_should_be_probed]
13  #  action: keep
14  #  regex: true
15  - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
16    regex: (.+);(.+);(.+)
17    replacement: $${1}://$${2}$${3}
18    target_label: __param_target
19  - target_label: __address__
20    replacement: blackbox-exporter-service:9115
21  - source_labels: [__param_target]
22    target_label: instance
23  - action: labelmap
24    regex: __meta_kubernetes_ingress_label_(.+)
25  - source_labels: [__meta_kubernetes_namespace]
26    target_label: kubernetes_namespace
27  - source_labels: [__meta_kubernetes_ingress_name]
28    target_label: kubernetes_name

Some types of metrics that we can receive from the service probing include:

Metric	Description
probe_duration_seconds	Returns how long the probe took to complete in seconds
probe_http_status_code	Response HTTP status code
probe_http_uncompressed_body_length	Length of uncompressed response body
probe_success	Displays whether or not the probe was a success

Notice that these metrics are similar to the service metrics above (as they are both probing metrics). The difference will lie in the labels.

Permissions

Kubernetes API Server

If you are scraping for kube-system components and the API server, the endpoints need to be enabled for private access. More information on this can be found here.

RBAC for Other Kubernetes Resources

In order for service discovery and scraping to work, the ADOT Collector pod may need permissions to get and list objects of the EKS cluster. By default, the OTel Collector uses the default service account to communicate with the API server. We can set a ClusterRoleBinding for this service account such that it can access and scrape the necessary metric endpoints.

Example configuration:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otelcol-rbac
subjects:
- kind: ServiceAccount
  name: default
  namespace: otelcol
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
---

If necessary, we can restrict the service account access to the specific Kubernetes resources we want to scrape. For instance, if we were specifically scraping for pod-level metrics, we could use the following ClusterRole and ClusterRoleBinding:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prom-admin
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list"]
- nonResourceURLs:
  - /metrics
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prom-rbac
subjects:
- kind: ServiceAccount
  name: default
  namespace: otelcol
roleRef:
  kind: ClusterRole
  name: prom-admin
  apiGroup: rbac.authorization.k8s.io
---

To scrape Node or cAdvisor metrics, we need to provide the service account access to nodes/metrics. As an example, the following ClusterRole and ClusterRoleBinding should work for Pod, Node, Service, and cAdvisor metrics.

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prom-admin
rules:
- apiGroups: [""]
  resources:
    - nodes
    - nodes/proxy
    - nodes/metrics
    - services
    - endpoints
    - pods
  verbs: ["get", "list", "watch"]
- nonResourceURLs:
  - /metrics
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prom-rbac
subjects:
- kind: ServiceAccount
  name: default
  namespace: otelcol
roleRef:
  kind: ClusterRole
  name: prom-admin
  apiGroup: rbac.authorization.k8s.io
---

Prometheus Remote Write Exporter Configurations

For the Prometheus Remote Write Exporter to sign your HTTP requests with AWS SigV4 (AWS’ authentication protocol for secure authentication), you will need to provide the auth configuration with the sigv4auth authenticator. If auth is not provided, HTTPs requests will not be signed.

1extensions:
2  sigv4auth:
3    service: "aps"
4    region: "user-region"
5exporters:
6  prometheusremotewrite:
7    endpoint: "https://aws-managed-prometheus-endpoint/v1/api/remote_write"
8    auth:
9      authenticator: sigv4auth

Aside from the auth configurations, the Prometheus Remote Write Exporter is also configurable with retry, sending queue, and timeout settings. An example of these configurations is provided below.

1extensions:
2  sigv4auth:
3    service: "aps"
4    region: "us-east-1"
5exporters:
6  prometheusremotewrite:
7    endpoint: "https://aps-workspaces-gamma.us-east-1.amazonaws.com/workspaces/ws-7cd45747-2381-4a2a-847f-fa61a3694a74/api/v1/remote_write"
8    namespace: test
9    auth:
10      authenticator: sigv4auth
11    retry_on_failure:
12      enabled: true
13      initial_interval: 5s
14      max_interval: 10s
15      max_elabsed_time: 30s
16    timeout: 15s

More information on the possible retry, sending queue, and timeout configurations can be found here.

Takeaway

These advanced configurations should enable you to monitor your applications and Kubernetes cluster in a secure and reliable manner. If you would like a more basic setup, please take a look at the getting started with the AWS Distro for OpenTelemetry Collector-AMP Pipeline in EKS Guide.

We would love to hear more common configuration scenarios or improvements to this documentation from you! Please submit an issue on the aws-otel community to let us know.

Prometheus Remote Write Exporter Advanced Configurations for AMP

Prometheus Remote Write Exporter Advanced Configurations for AMP

Prometheus Receiver Configurations

Additional Kubernetes/EKS Scraping Configurations

Endpoints

Pods

Kubernetes (k8s) API Server

cAdvisor

Nodes

Services

Prerequisite

Ingresses

Permissions

Kubernetes API Server

RBAC for Other Kubernetes Resources

Prometheus Remote Write Exporter Configurations

Takeaway

On this page