Tier 3 logging in code: OpenTelemetry to Grafana Tempo

I ended Tier 2 with correlationId and traceparent strings threaded through the EventBridge detail by hand. Here I swap the strings for real spans, in three changes against Tier 2:

  1. An OpenTelemetry SDK init module. lib/otel.ts configures a NodeTracerProvider with an OTLP HTTP exporter and registers auto-instrumentation for the AWS SDK and http.
  2. propagation.inject and propagation.extract replace manual traceparent string handling. The API Lambda’s span becomes the parent; the EventBridge subscriber continues it.
  3. A local Grafana Tempo + Grafana docker-compose. Traces ship to host.docker.internal:4318; you read them in Grafana at localhost:3000. The same code targets Grafana Cloud by swapping three env vars.

Repo: github.com/danieljohnmorris/aws-api-logging-tiers. The tier-3/ folder is the diff against tier-2/:

git diff tier-2..tier-3 -- handlers/ lib/ cdk/

OTel SDK init

lib/otel.ts runs at module load. The two handler files import it FIRST, before any AWS SDK or zod imports, so auto-instrumentation can wrap them:

// tier-3/lib/otel.ts
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { registerInstrumentations } from '@opentelemetry/instrumentation';
import { AwsInstrumentation } from '@opentelemetry/instrumentation-aws-sdk';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';

const provider = new NodeTracerProvider({ resource: new Resource({
  [SemanticResourceAttributes.SERVICE_NAME]: process.env.OTEL_SERVICE_NAME!,
})});

provider.addSpanProcessor(new BatchSpanProcessor(
  new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT, headers }),
  { maxQueueSize: 100, scheduledDelayMillis: 500 },
));

provider.register();

registerInstrumentations({
  instrumentations: [new HttpInstrumentation(), new AwsInstrumentation()],
});

Two things worth flagging:

  • scheduledDelayMillis: 500. Lambda invocations are short-lived; the default batch delay (5 s) means most traces never get flushed before the runtime freezes the process. 500 ms is enough to batch within an invocation but short enough to ship before the container goes idle.
  • headers. For local Tempo this is {}. For Grafana Cloud it’s { Authorization: 'Basic <base64(instanceId:token)>' }, derived from two env vars at module load.

Span context replaces manual traceparent

Tier 2’s handler called event.headers['traceparent'] and stuffed the string into the EventBridge detail. Tier 3 uses the OTel API directly:

// tier-3/handlers/api.ts
import { context, propagation, trace, SpanKind } from '@opentelemetry/api';

const tracer = trace.getTracer('tier3-api');

export const handler: APIGatewayProxyHandlerV2 = async (event) => {
  const parentCtx = propagation.extract(context.active(), event.headers ?? {});

  return tracer.startActiveSpan(
    'POST /orders',
    { kind: SpanKind.SERVER },
    parentCtx,
    async (span) => {
      span.setAttributes({ order_id, customer_id, sku });

      await ddb.send(new PutItemCommand({ /* ... */ }));

      const carrier: Record<string, string> = {};
      propagation.inject(context.active(), carrier);

      await events.send(new PutEventsCommand({
        Entries: [{ /* ... */ Detail: JSON.stringify({
          orderId, customerId, sku, amount, correlationId,
          traceparent: carrier.traceparent,
          tracestate: carrier.tracestate,
        }) }],
      }));

      span.end();
    },
  );
};

The shape is identical to Tier 2 from the EventBridge bus’s perspective: a traceparent string on the detail. The difference is upstream of the bus. The string now comes out of propagation.inject, and the API Lambda’s span is a real OTel span the exporter ships to Tempo.

The subscriber does the inverse:

// tier-3/handlers/process.ts
const parentCtx = propagation.extract(context.active(), {
  traceparent: detail.traceparent,
  tracestate: detail.tracestate,
});

return tracer.startActiveSpan(
  'process OrderCreated',
  { kind: SpanKind.CONSUMER },
  parentCtx,
  async (span) => { /* ... */ },
);

Tempo now shows the two spans on the same trace, with the consumer rooted under the API span.

Tempo + Grafana via docker-compose

docker-compose.tempo.yml runs the two services and provisions the Grafana data source so there’s no manual setup:

services:
  tempo:
    image: grafana/tempo:2.7.0
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tier-3/grafana/tempo.yaml:/etc/tempo.yaml
    ports: ["3200:3200", "4318:4318", "4317:4317"]

  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
      - GF_AUTH_DISABLE_LOGIN_FORM=true
    volumes:
      - ./tier-3/grafana/provisioning:/etc/grafana/provisioning
    ports: ["3000:3000"]

The Lambda inside LocalStack reaches Tempo over host.docker.internal:4318 (set as the default OTEL_EXPORTER_OTLP_ENDPOINT in the CDK stack).

What a trace looks like

After deploying the stack and posting five orders, Tempo’s search panel lists each invocation:

Tempo search

Opening one trace shows the full call graph in 24 ms: tier3-api POST /orders as the root, DynamoDB.PutItem and EventBridge.PutEvents as auto-instrumented child spans, then the linked tier3-process process OrderCreated consumer span (rooted under the API span via the threaded traceparent) and its DynamoDB.UpdateItem:

One trace, flame view

A trace flame graph answers per-step latency and per-step errors on one screen, which Logs Insights cannot.

How to run it

git clone https://github.com/danieljohnmorris/aws-api-logging-tiers
cd aws-api-logging-tiers
docker compose -f docker-compose.yml -f docker-compose.tempo.yml up -d
pnpm install
pnpm --filter @aws-api-logging-tiers/tier-3 run deploy
pnpm --filter @aws-api-logging-tiers/tier-3 run trigger
# trigger output prints a traceId. Open http://localhost:3000 (Grafana),
# select Tempo, paste the traceId into the TraceQL query box, hit Run.

Going to Grafana Cloud

The same code ships to Grafana Cloud with three env vars in .env:

OTLP_ENDPOINT=https://otlp-gateway-prod-eu-west-2.grafana.net/otlp/v1/traces
OTLP_USER=<instance-id-from-grafana-cloud>
OTLP_TOKEN=<access-policy-token-with-traces-write>

lib/otel.ts reads them at module load, base64-encodes ${OTLP_USER}:${OTLP_TOKEN} into a Basic Authorization header, and ships to the public endpoint. The handler code does not change. The local Docker stack stays useful for fast iteration; you only point at Grafana Cloud when you want the trace to outlive your laptop.

What did not change

The hand-rolled correlationId from Tier 1 is still in every log line and on every EventBridge payload. Powertools is still the logger; this repo’s tier-3/ drops the EMF metric calls to keep the diff focused on tracing, but a real Tier 3 stack would keep Tier 2’s EMF metrics alongside the OTel traces. The DynamoDB table, the EventBridge bus, the API Gateway, and the two Lambda function shapes are all identical to Tier 1.

Each tier adds rather than replaces. Tier 1’s correlationId still travels alongside Tier 3’s traceparent. Tier 2’s Powertools logger keeps doing structured logs while OTel handles the trace.