AWS Distro for OpenTelemetry

Processors

Processors

Processors are used in several stages of an OpenTelemetry collector pipeline. They are used to pre-process the data being passed in the pipeline. In a processor the data can be modified, batched, filtered or sampled. The ADOT collector supports a selected list of processors.




Processors supported by ADOT collector

The ADOT collector supports the following processors:

Notes on Group by Trace and Tail Sampling processors

In order to achieve the desired results when using the Tail Sampling and Group by Trace processors, do not use a Batch processor before these components in a pipeline. Using a Batch processor before these components might separate spans belonging to a same trace. It is important to pay attention to this detail because these components will try to group all the spans belonging to a trace. In the case of the Tail Sampling processor this will allow for a sampling decision to affect all spans of a trace, creating a full picture of the trace in case it is sampled. A Batch processor immediately after these components does not cause any problems and is recommended to properly pre-process data for subsequent exporters.

Also, you need to make sure that all the spans for a trace are processed in the same collector instances. This is specially important for a collector running in gateway mode.

Besides that, you have to tune the wait_duration parameter of the Group by Trace processor and decision_wait parameter of the Tail Sampling processor to be greater than or equal to the maximum expected latency of a trace in your system. Also, be sure to include a grace period for network latency between an application and collector. Again, this will guarantee that spans of a same trace are processed in the same batch.

Finally to really limit the number of traces that should be kept in memory, we recommend that you use the Group by Trace processor before the Tail Sampling processor. The reason why is because the Group by Trace processor implements a limit for the number of traces to be kept in memory while this is not fully implemented in the Tail Sampling processor.

The Group by Trace processor will drop the oldest trace in case the num_traces limit is exceeded. wait_duration and num_traces should be scaled to consider the expected traffic in the monitored applications.

Examples

If the maximum expected latency for a request in your application is 10s and the maximum traffic in number of requests per second that your application can have is 1000 requests per second, wait_duration should be set to 10s and num_traces should be set to at least 10000 (10 * 1000 requests per second). It is highly recommended that you monitor the otelcol_processor_groupbytrace_traces_evicted metric from the collector self telemetry. If the value in the metric is greater than zero, that means that the collector is receiving more traffic than it can handle and you should increase the num_traces accordingly.

Example from the description above:

processors:
groupbytrace:
wait_duration: 10s
num_traces: 20000 # Double the max expected traffic (2 * 10 * 1000 requests per second)
tail_sampling:
decision_wait: 1s # This value should be smaller than wait_duration
policies:
- ..... # Applicable policies
batch/tracesampling:
timeout: 0s # No need to wait more since this will happen in previous processors
send_batch_max_size: 8196 # This will still allow us to limit the size of the batches sent to subsequent exporters
service:
pipelines:
traces/tailsampling:
receivers: [otlp]
processors: [groupbytrace, tail_sampling, batch/tracesampling]
exporters: [awsxray]

The Tail Sampling processor has the functionality to combine sampling policies. For example, to sample traces from a specific path in case of errors you could use the following configuration:

processors:
tail_sampling:
decision_wait: 1s
policies:
- name: and-policy
type: and
and:
and_sub_policy:
- name: path-policy
type: string_attribute
string_attribute:
key: http.url
values: ["\/users"]
enabled_regex_matching: true
- name: error-policy
type: status_code
status_code:
status_codes: ["ERROR", "UNSET"]

In the next example we will sample 20% of the spans that present an error:

processors:
tail_sampling:
decision_wait: 1s
policies:
- name: and-policy
type: and
and:
and_sub_policy:
- name: error-policy
type: status_code
status_code:
status_codes: ["ERROR", "UNSET"]
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 20

To see the full set of policy options available to the tail sampling processor please refer to it's README.