Skip to content

Debug a prod incident across services

The problem

A user filed a bug saying "checkout broke around 14:32." You don't know which service to blame. You have OTLP logs flowing into kyma from every service. Three KQL questions get you from "something happened" to a specific request that exposes the failure.

The schema

This recipe assumes the auto-created otel_logs table populated by your services' OTLP exporters via the OTLP frontend. The columns you'll use:

ColumnTypeSource
_timestamptimestampOTLP time_unix_nano.
severity_textstringOTLP severity_text (e.g., ERROR).
service_namestringOTLP resource attribute.
bodystringOTLP log body.
trace_idstringOTLP trace_id, present on logs from a traced request.

If you're shipping in non-OTLP logs, ensure service_name and trace_id exist as typed columns; otherwise promote them from dynamic before running the recipe.

The queries

1. Find the noise

Bound by time, filter to errors, count by service.

kql
otel_logs
| where _timestamp > datetime("2026-05-03T14:30:00Z")
| where _timestamp < datetime("2026-05-03T14:35:00Z")
| where severity_text == "ERROR"
| summarize n = count() by service_name
| order by n desc

The service with the most errors is your suspect — usually obvious from the count.

2. Read the suspect's bodies

Same window, filter to the suspect, project the bodies.

kql
otel_logs
| where _timestamp > datetime("2026-05-03T14:30:00Z")
| where _timestamp < datetime("2026-05-03T14:35:00Z")
| where service_name == "checkout-svc"
| where severity_text == "ERROR"
| project _timestamp, trace_id, body
| take 20

Skim the bodies; pick a representative trace_id.

3. Follow the trace across services

Drop the service filter; bound by trace_id.

kql
otel_logs
| where trace_id == "abc1234..."
| project _timestamp, service_name, severity_text, body
| order by _timestamp asc

You see the request enter web-svc, hop to checkout-svc, and crash in the payments-svc call. Total time from "got the bug" to "specific failed call": about a minute.

What you should see

Query 1 returns one row per service in the window. Realistic shape:

service_namen
checkout-svc247
payments-svc89
web-svc14

Query 3, ordered by time, walks the request through your stack:

_timestampservice_nameseverity_textbody (truncated)
2026-05-03T14:32:08.124Zweb-svcINFOPOST /api/checkout received
2026-05-03T14:32:08.131Zcheckout-svcINFOreserving inventory for cart_id=42
2026-05-03T14:32:08.219Zcheckout-svcINFOcharging via payments-svc
2026-05-03T14:32:11.487Zpayments-svcERRORupstream timeout: stripe-api 30000ms
2026-05-03T14:32:11.488Zcheckout-svcERRORcharge failed; rolling back reservation
2026-05-03T14:32:11.489Zweb-svcERROR502 Bad Gateway

Variations

  • Error rate over time: in query 1, replace summarize n = count() by service_name with extend bucket = bin(_timestamp, 30s) | summarize n = count() by service_name, bucket.
  • Last hour, no specific window: replace the two _timestamp filters in query 1 with where _timestamp > ago(1h).
  • Compare to baseline: subtract two summarize results — one for the suspect window, one for the same length of time the day before.
  • Different signal: swap otel_logs for traces if your spans table is populated. Same shape; the equivalent of severity_text == "ERROR" is status == "ERROR" on most exporters.