Skip to content

Introduction to distributed tracing

What is tracing?

Tracing allows you to see how a request progresses through different components, timing of each operation, any logs and errors as they occur.

Why tracing?

A typical web application consists of multiple components written in different languages and running on different platforms:

  • Load balancer (e.g. nginx).
  • Frontend code (e.g. React).
  • Backend monolith or microservices.
  • At least one database.
  • Task/job queue.

Distributed tracing collects data from such diverse environments and allows you to:

  • Monitor performance of each operation (for example, SQL query), individual components (database), and the whole request round trip.
  • Monitor logs and errors no matter where they come from.
  • Tie everything together into a single trace.

Spans

A span represents an operation (unit of work) in a trace. A span could be a remote procedure call (RPC), a database query, or potentially interesting code. A span has:

  • A parent span.
  • An operation name (span name).
  • A span kind.
  • Start and end time (duration = end_time - start_time).
  • A status that reports whether operation succeeded or failed.
  • A set of key-value attributes describing the operation.
  • A timeline of events.
  • A list of links to other spans.
  • A span context that propagates trace ID and other data between different services.

A trace is a tree of spans that shows the path that a request makes through an app. A span is an operation that your app performs handling a request.

Events

An event is a an entity within a span. Events have start time but no end time (and therefore no duration). But events can have all the same key-value attributes as spans.

Events represent exceptions, errors, logs, and messages (such as in RPC). Note how none of those have end time and duration.

Naming spans and attributes

See Naming spans and semantic attributes.

Span kind

Span kind must have one of the following values:

  • server for server operations.
  • client for client operations.
  • producer for message producers.
  • consumer for message consumers.
  • internal for internal operations.

Status code

Status code reports whether span operation succeeded or failed. It must have one of the following values:

  • unset - the default value.
  • ok - success.
  • error - failure.

Trace context

Trace context is a request-scoped data such as:

  • trace id - unique trace identificator;
  • span id - unique span identificator;
  • trace flags - various flags such as sampled, deferred, and debug.

OpenTemetry propagates context between functions within a process (in-process propagation) and even from one service to the next one (distributed propagation). Distributed tracing uses context for span correlation, for example, assembling spans from multiple services into a single trace.

Sampling

Sampling controls whether OpenTelemetry records and exports a trace. It is used to reduce the noise and the cost of tracing. Sampling ensures that the whole (potentially distributed) trace is either sampled or dropped. It achieves that by using context propagation and/or TraceIdRatioBased sampler.

OpenTelemetry has 2 span properties responsible for sampling:

  • IsRecording - when false, span discards attributes, events, links etc.
  • Sampled - when false, OpenTelemetry drops the span.

You should check IsRecording property to avoid collecting expensive trace data.

Sampler is a function that accepts a root span about to be created. The function returns a sampling decision which must be one of:

  • Drop - trace is dropped. IsRecording = false, Sampled = false.
  • RecordOnly - trace is recorded but not sampled. IsRecording = true, Sampled = false.
  • RecordAndSample - trace is recorded and sampled. IsRecording = true, Sampled = true.

By default Uptrace samples all traces, but you can configure it to sample only a fraction of traces. In that case Uptrace uses information from the sampler to adjust number of spans according to the sampling propability.

What is OpenTelemetry?

OpenTelemetry is a vendor-neutral API for distributed traces and metrics. It defines a specification that standardizes how to collect and send telemetry data to backend platforms. It means that you can instrument your application once and then add or change vendors (tracing backends) as required. OpenTelemetry is available for most programming languages and provides tracing interoperability across different languages and environments.

Uptrace uses OpenTelemetry for collecting traces, logs, and errors. The outline of the process is the following:

  • OpenTelemetry API instruments your application with spans and metrics.
  • OpenTelemetry SDK exports collected data to Uptrace.
  • Uptrace uses that information to help you profile, monitor, and debug your application.

OpenTelemetry overhead

To monitor your app or library, OpenTelemetry instrumentation introduces inevitable performance overhead and additional code dependencies. But the overhead is relatively small and a lot of efforts have been put to minimize possible negative impact on end-user application. For example, OpenTelemetry exporter API is asynchronous and does not block user application.

When disabled, the overhead of OpenTelemetry instrumentation is few function calls with minimal memory allocations. When enabled, the overhead is specific to the instrumented library and you can sample only subset of spans that are of interest to you.

To summarize:

  • When disabled there is practically no overhead. Without an implementation OpenTelemetry does not produce telemetry.
  • When enabled the overhead is small and under your control. Whenever possible API is asynchronous and does not block end-user application.

Tracing vs execution hooks

To monitor a complex system we need dozens of execution hooks that provide:

  • Connection details and time required to establish a connection.
  • A request and time required to build and write the request.
  • A response and time required to read and build the response.
  • Information about retries and errors.
  • Any other system specific information.

OpenTemetry (or tracing in general) replaces dozens of different hooks with a concise API created specifically for monitoring. Such API has similar performance overhead but is more flexible and easier to maintain.

OpenTelemetry API overview

OpenTelemetry provides following APIs:

  • TracerProvider registers providers and creates tracers.
  • Tracer creates spans.
  • Span is an operation within a trace.
  • SpanContext propagates context information like trace id in something like W3C Trace Context.
  • SpanProcessor batches spans together.
  • SpanExporter exports collected spans to Uptrace (or another backend).

Each programming language has slightly different API but usually you have to:

  • Set default tracer provider.
  • Register Uptrace span processor using the provider.
  • Create a named tracer for your application or library.
  • Start/end spans using the tracer.