In the middle of an incident, the bottleneck is not data collection. The bottleneck is asking the right question fast, and trusting the answer. At Qonto, we built an AI-assisted observability workflow where engineers can ask natural-language questions and immediately get verifiable outputs: the system shows the exact SQL it ran on ClickHouse and returns evidence like trace IDs and scoped impact.

This quick talk shares the core architecture and the “trust patterns” that make it usable under pressure: OpenTelemetry as a common schema, ClickHouse as a single analytical backend for high-cardinality observability data, read-only query tooling, and a UI that keeps humans in the loop. You will leave with a clear blueprint to build a conversational investigation experience without turning incident response into a hallucination lottery.


Outline (25 minutes)

  1. The incident problem: speed, cognitive load, and knowledge gaps (2 min)
  2. Design constraints: hallucinations, cost, and big contexts (3 min)
  3. Why ClickHouse + OTel works: one schema, high-cardinality welcome, compression wins (7 min)
  4. The conversational layer: read-only tool, NL → SQL, show the SQL, return trace IDs (5 min)
  5. Mini case study: from “we have 500s” to scoped impact and root cause hints (6 min)
  6. Three takeaways (2 min)
    • Compute in ClickHouse, not in the agent.
    • Always make answers auditable (SQL + evidence).
    • Keep humans in the loop for trust.