Observability

Overview

The API ships traces, metrics, and logs to Azure Monitor (Application Insights + Log Analytics) in non-dev environments via UseAzureMonitor() in Program.cs. In Development, OTel is wired up but the Azure exporter is off (set Telemetry:ConsoleMetrics to true to dump metrics to stdout instead).

This page lists the custom signals — the bits beyond what Azure.Monitor.OpenTelemetry.AspNetCore gives you for free — and how to query them.

Custom signals

All defined in FormationTelemetry.cs.

Activity sources

Source	What it spans
`Formation.Handlers`	LiteBus command + event handlers (one span per call)
`Formation.Search`	Full-text search operations
`Formation.Browser`	Browser-emitted spans propagated up via headers

Histograms (meter `Formation.API`)

Metric	Unit	Tags	Source
`formation.handler.duration`	ms	`command`	`InstrumentedCommandMediator`
`formation.event.duration`	ms	`event`	`InstrumentedEventMediator`
`formation.search.duration`	ms	`entity`	`FormationTelemetry.SearchScope`
`formation.efcommand.duration`	ms	`command_type`, `success`	`SlowQueryLoggingInterceptor`

Span events

Event	Where	Attributes
`ef.slow_command`	Current Activity (any)	`db.command_type`, `db.duration_ms`, `db.success`, `db.statement` (4 KB max), optional `db.param.*`

The ef.slow_command event fires whenever an EF Core DbCommand exceeds Telemetry:SlowQueryThresholdMs (default 500). It hangs off whatever Activity is current at the moment EF runs the command — so you’ll see it nested inside the controller / command handler / event handler that triggered it. If no Activity is current (e.g. background work), the slow command is logged via ILogger instead, which still ships to Log Analytics.

Runtime metrics (`OpenTelemetry.Instrumentation.Runtime`)

Wired up via .AddRuntimeInstrumentation() in Program.cs. Publishes the standard .NET runtime counters under dotnet.* — note these were called process.runtime.dotnet.* in earlier versions of the package; the 1.15.x line uses the shorter form that ships in customMetrics. The most useful ones for chasing latency stalls:

Metric	Tags / dimensions	What it tells you
`dotnet.gc.collections`	`gc.heap.generation`	Per-generation collection rate (gen0 / gen1 / gen2). Spikes in gen2 line up with heap-pressure stalls.
`dotnet.gc.last_collection.heap.size`	`gc.heap.generation`	Heap size per generation, sampled. Sudden drop = a Gen 2 collection just happened.
`dotnet.gc.pause.time`	—	Cumulative seconds the runtime spent paused in GC. The sharp number for “is GC the bottleneck”.
`dotnet.gc.heap.total_allocated`	—	Lifetime allocation total (bytes). Rate-of-change under load is the allocation-pressure signal.
`dotnet.thread_pool.thread.count`	—	Worker-thread pool size. Slow climb under load = pool starvation symptoms.
`dotnet.thread_pool.work_item.count`	—	Queue depth in the thread pool. High here under load is the smoking gun for sync-over-async.
`dotnet.monitor.lock_contentions`	—	Contended `lock` / `Monitor` acquisitions. Spikes correlate with serialised hot paths.
`dotnet.jit.compilation.time`	—	Cumulative JIT time. Useful to rule in/out cold-start / tiered-recompilation stalls.

We added these after the detail-page load test surfaced lockstep latency stalls across all six list routes that didn’t show up in any SQL or Kestrel metric — see app.loadtests/BASELINE.md for the investigation. Without runtime metrics the GC hypothesis was unprovable.

Configuration

Both keys live under Telemetry: in appsettings.json. Override per environment via the standard ASP.NET configuration chain.

{
  "Telemetry": {
    "SlowQueryThresholdMs": 500,
    "LogSlowQueryParameters": false
  }
}

SlowQueryThresholdMs — int, ms. Below this threshold the command is recorded in the histogram only (cheap). Above, you also get a span event with the SQL.
LogSlowQueryParameters — bool. Off by default. Flip on locally when chasing a specific slow query and you need parameter values to reproduce. Don’t enable in prod without thinking about parameter-value sensitivity.

Where to find the data

Where to query

The same data is reachable from two surfaces, with different table names:

App Insights resource → Logs blade uses the classic schema: customMetrics, traces, dependencies, customEvents, requests, exceptions. Columns: name, timestamp, customDimensions.
Log Analytics workspace → Logs blade uses the workspace-native schema: AppMetrics, AppTraces, AppDependencies, AppEvents, AppRequests, AppExceptions. Columns: Name, TimeGenerated, Properties.

The queries below use the workspace-native names because they work on both surfaces (the AI Logs blade transparently accepts them too). If you’re already comfortable with the classic schema, swap each App* for the lowercase form and the Properties[...] / Name / TimeGenerated columns for customDimensions[...] / name / timestamp.

Useful queries

Find anything we just emitted — handy first thing after deploy. Scope search to the App-Insights-derived tables; an unscoped search scans every workspace table (including infra tables like Heartbeat, Perf, AzureActivity) which don’t share the Properties column and will fail the projection.

search in (AppTraces, AppDependencies, AppEvents, AppRequests, AppExceptions) "ef.slow_command"
| where TimeGenerated > ago(1h)
| summarize count() by $table

Once you know which table is carrying the events, swap to a table-specific query (the EF-targeted ones below).

If formation.efcommand.duration rows are arriving but no ef.slow_command events have fired, nothing’s been slow enough yet under the default 500 ms threshold. To smoke-test the slow path drop the threshold via Telemetry:SlowQueryThresholdMs and redeploy.

AppMetrics
| where Name == "formation.efcommand.duration"
| where TimeGenerated > ago(1h)
| count

EF command duration by command type — averages and max from the pre-aggregated metrics. For accurate p95/p99 use the App Insights Metrics blade (chart mode, then change aggregation to p95/p99) — AppMetrics rows are pre-aggregated by interval so KQL percentile over them is approximate.

AppMetrics
| where Name == "formation.efcommand.duration"
| where TimeGenerated > ago(1h)
| extend command_type = tostring(Properties.command_type),
         success     = tobool(Properties.success)
| summarize
    avgMs = avg(Sum / ItemCount),
    maxMs = max(Max),
    count = sum(ItemCount)
    by command_type, success
| order by avgMs desc

Slow EF commands with SQL text — the ef.slow_command span event is exported as a trace record by the Azure Monitor exporter. The Properties bag carries db.statement, db.duration_ms, and db.command_type.

AppTraces
| where TimeGenerated > ago(1h)
| where Message has "ef.slow_command"
   or Properties has "ef.slow_command"
| extend statement   = tostring(Properties["db.statement"]),
         durationMs  = todouble(Properties["db.duration_ms"]),
         commandType = tostring(Properties["db.command_type"])
| project TimeGenerated, durationMs, commandType, statement, OperationId
| order by durationMs desc

The OperationId joins back to the parent request — chain it with AppRequests | where OperationId == "..." to see which endpoint the slow SQL came from.

Slow EF commands logged via ILogger — the fallback path when no Activity is current (background work):

AppTraces
| where TimeGenerated > ago(1h)
| where Message startswith "Slow EF command"
| project TimeGenerated, Message, SeverityLevel

Top 10 slowest command handlers by avg duration:

AppMetrics
| where Name == "formation.handler.duration"
| where TimeGenerated > ago(1h)
| extend command = tostring(Properties.command)
| summarize avgMs = avg(Sum / ItemCount), maxMs = max(Max), n = sum(ItemCount) by command
| order by avgMs desc
| take 10

HTTP route latency — already auto-instrumented by Azure.Monitor.OpenTelemetry.AspNetCore, no custom code needed. Prefer Properties["http.route"] (the route template, e.g. /api/Schemes/{key}) and fall back to Name when the property isn’t present:

AppRequests
| where TimeGenerated > ago(1h)
| where Success == true
| extend label = coalesce(tostring(Properties["http.route"]), Name)
| summarize p50=percentile(DurationMs, 50), p95=percentile(DurationMs, 95), p99=percentile(DurationMs, 99), count() by label
| order by p95 desc
| take 20

Drill from a slow request to its full SQL fan-out — the highest-leverage single query for triage. Pick an OperationId from one of the queries above (slow request, slow EF command, anything tied to a span), and pull every row the request produced:

let opId = "<OperationId>";
union withsource=SourceTable AppRequests, AppDependencies, AppTraces, AppExceptions
| where OperationId == opId
| project TimeGenerated, SourceTable, Name, Message, DurationMs, Success
| order by TimeGenerated asc

This turns “this query is slow in isolation” into “this route is slow because it issues this query and N others” — usually the actionable fix.

KQL pitfalls

A few things that bit us during real triage and aren’t obvious from the table schemas:

sql is a reserved word in KQL. extend sql = ... parses as a syntax error. Use a different identifier (statement, cmd).
In a union of tables, you need withsource=<Col> to capture the source table name. Bare $table doesn’t resolve in a project, and itemType (a column in the classic Application Insights schema) doesn’t exist on the workspace-native tables. Always union withsource=SourceTable A, B, C | project SourceTable, …. Avoid the obvious name Source — AppDependencies already has a column called that and the union will be rejected with “column named ‘Source’ already exists”.
Unscoped search fans out across every workspace table — infra tables like Heartbeat / Perf / AzureActivity don’t share the Properties column and will fail any subsequent projection that references it. Always scope: search in (AppTraces, AppDependencies, AppEvents, AppRequests, AppExceptions) "<term>".
AppMetrics rows are pre-aggregated by interval. KQL percentile(Sum / ItemCount, 95) over them is approximate. For exact percentiles, use the App Insights Metrics blade chart UI with the percentile aggregation.

Local development

Telemetry:ConsoleMetrics = true in appsettings.Development.json prints OTel histograms to the console at the configured export interval. Useful when iterating on a hot path without leaving the machine.

What’s not (yet) instrumented

Per-OData-controller spans. The HTTP request span (auto-instrumented by Azure.Monitor.OpenTelemetry.AspNetCore) covers the route as a whole. The EF command spans break the SQL portion out. The C# post-processing time inside the controller (e.g. LoadPolymorphicCollectionsAsync) is currently the residual after subtracting EF time from the request span.
IQueryMediator handlers. LiteBus is wired up to discover them, but the API doesn’t currently inject IQueryMediator anywhere — every read flows through OData controllers + EF directly. If query handlers ever start being used, mirror InstrumentedCommandMediator to wrap them.
Frontend RUM / Web Vitals. The SvelteKit app has no client-side OTel SDK initialised today.
Container CPU/memory dashboards or alerts. The data lands in Log Analytics from ACA defaults, but no workbook or alert rule is provisioned.