Observability
Overview
Section titled “Overview”The API ships traces, metrics, and logs to Azure Monitor (Application
Insights + Log Analytics) in non-dev environments via
UseAzureMonitor() in Program.cs.
In Development, OTel is wired up but the Azure exporter is off (set
Telemetry:ConsoleMetrics to true to dump metrics to stdout instead).
This page lists the custom signals — the bits beyond what
Azure.Monitor.OpenTelemetry.AspNetCore gives you for free — and how
to query them.
Custom signals
Section titled “Custom signals”All defined in FormationTelemetry.cs.
Activity sources
Section titled “Activity sources”| Source | What it spans |
|---|---|
Formation.Handlers | LiteBus command + event handlers (one span per call) |
Formation.Search | Full-text search operations |
Formation.Browser | Browser-emitted spans propagated up via headers |
Histograms (meter Formation.API)
Section titled “Histograms (meter Formation.API)”| Metric | Unit | Tags | Source |
|---|---|---|---|
formation.handler.duration | ms | command | InstrumentedCommandMediator |
formation.event.duration | ms | event | InstrumentedEventMediator |
formation.search.duration | ms | entity | FormationTelemetry.SearchScope |
formation.efcommand.duration | ms | command_type, success | SlowQueryLoggingInterceptor |
Span events
Section titled “Span events”| Event | Where | Attributes |
|---|---|---|
ef.slow_command | Current Activity (any) | db.command_type, db.duration_ms, db.success, db.statement (4 KB max), optional db.param.* |
The ef.slow_command event fires whenever an EF Core DbCommand
exceeds Telemetry:SlowQueryThresholdMs (default 500). It hangs off
whatever Activity is current at the moment EF runs the command — so
you’ll see it nested inside the controller / command handler / event
handler that triggered it. If no Activity is current (e.g. background
work), the slow command is logged via ILogger instead, which still
ships to Log Analytics.
Runtime metrics (OpenTelemetry.Instrumentation.Runtime)
Section titled “Runtime metrics (OpenTelemetry.Instrumentation.Runtime)”Wired up via .AddRuntimeInstrumentation() in Program.cs.
Publishes the standard .NET runtime counters under dotnet.* —
note these were called process.runtime.dotnet.* in earlier
versions of the package; the 1.15.x line uses the shorter form
that ships in customMetrics. The most useful ones for chasing
latency stalls:
| Metric | Tags / dimensions | What it tells you |
|---|---|---|
dotnet.gc.collections | gc.heap.generation | Per-generation collection rate (gen0 / gen1 / gen2). Spikes in gen2 line up with heap-pressure stalls. |
dotnet.gc.last_collection.heap.size | gc.heap.generation | Heap size per generation, sampled. Sudden drop = a Gen 2 collection just happened. |
dotnet.gc.pause.time | — | Cumulative seconds the runtime spent paused in GC. The sharp number for “is GC the bottleneck”. |
dotnet.gc.heap.total_allocated | — | Lifetime allocation total (bytes). Rate-of-change under load is the allocation-pressure signal. |
dotnet.thread_pool.thread.count | — | Worker-thread pool size. Slow climb under load = pool starvation symptoms. |
dotnet.thread_pool.work_item.count | — | Queue depth in the thread pool. High here under load is the smoking gun for sync-over-async. |
dotnet.monitor.lock_contentions | — | Contended lock / Monitor acquisitions. Spikes correlate with serialised hot paths. |
dotnet.jit.compilation.time | — | Cumulative JIT time. Useful to rule in/out cold-start / tiered-recompilation stalls. |
We added these after the detail-page load test surfaced lockstep
latency stalls across all six list routes that didn’t show up in any
SQL or Kestrel metric — see
app.loadtests/BASELINE.md
for the investigation. Without runtime metrics the GC hypothesis was
unprovable.
Configuration
Section titled “Configuration”Both keys live under Telemetry: in appsettings.json. Override per
environment via the standard ASP.NET configuration chain.
{ "Telemetry": { "SlowQueryThresholdMs": 500, "LogSlowQueryParameters": false }}SlowQueryThresholdMs— int, ms. Below this threshold the command is recorded in the histogram only (cheap). Above, you also get a span event with the SQL.LogSlowQueryParameters— bool. Off by default. Flip on locally when chasing a specific slow query and you need parameter values to reproduce. Don’t enable in prod without thinking about parameter-value sensitivity.
Where to find the data
Section titled “Where to find the data”Where to query
Section titled “Where to query”The same data is reachable from two surfaces, with different table names:
- App Insights resource → Logs blade uses the classic schema:
customMetrics,traces,dependencies,customEvents,requests,exceptions. Columns:name,timestamp,customDimensions. - Log Analytics workspace → Logs blade uses the workspace-native
schema:
AppMetrics,AppTraces,AppDependencies,AppEvents,AppRequests,AppExceptions. Columns:Name,TimeGenerated,Properties.
The queries below use the workspace-native names because they
work on both surfaces (the AI Logs blade transparently accepts them
too). If you’re already comfortable with the classic schema, swap
each App* for the lowercase form and the Properties[...] /
Name / TimeGenerated columns for customDimensions[...] /
name / timestamp.
Useful queries
Section titled “Useful queries”Find anything we just emitted — handy first thing after deploy.
Scope search to the App-Insights-derived tables; an unscoped
search scans every workspace table (including infra tables like
Heartbeat, Perf, AzureActivity) which don’t share the
Properties column and will fail the projection.
search in (AppTraces, AppDependencies, AppEvents, AppRequests, AppExceptions) "ef.slow_command"| where TimeGenerated > ago(1h)| summarize count() by $tableOnce you know which table is carrying the events, swap to a table-specific query (the EF-targeted ones below).
If formation.efcommand.duration rows are arriving but no
ef.slow_command events have fired, nothing’s been slow enough yet
under the default 500 ms threshold. To smoke-test the slow path drop
the threshold via Telemetry:SlowQueryThresholdMs and redeploy.
AppMetrics| where Name == "formation.efcommand.duration"| where TimeGenerated > ago(1h)| countEF command duration by command type — averages and max from the
pre-aggregated metrics. For accurate p95/p99 use the App Insights
Metrics blade (chart mode, then change aggregation to p95/p99) —
AppMetrics rows are pre-aggregated by interval so KQL percentile
over them is approximate.
AppMetrics| where Name == "formation.efcommand.duration"| where TimeGenerated > ago(1h)| extend command_type = tostring(Properties.command_type), success = tobool(Properties.success)| summarize avgMs = avg(Sum / ItemCount), maxMs = max(Max), count = sum(ItemCount) by command_type, success| order by avgMs descSlow EF commands with SQL text — the ef.slow_command span event
is exported as a trace record by the Azure Monitor exporter. The
Properties bag carries db.statement, db.duration_ms, and
db.command_type.
AppTraces| where TimeGenerated > ago(1h)| where Message has "ef.slow_command" or Properties has "ef.slow_command"| extend statement = tostring(Properties["db.statement"]), durationMs = todouble(Properties["db.duration_ms"]), commandType = tostring(Properties["db.command_type"])| project TimeGenerated, durationMs, commandType, statement, OperationId| order by durationMs descThe OperationId joins back to the parent request — chain it with
AppRequests | where OperationId == "..." to see which endpoint the
slow SQL came from.
Slow EF commands logged via ILogger — the fallback path when no Activity is current (background work):
AppTraces| where TimeGenerated > ago(1h)| where Message startswith "Slow EF command"| project TimeGenerated, Message, SeverityLevelTop 10 slowest command handlers by avg duration:
AppMetrics| where Name == "formation.handler.duration"| where TimeGenerated > ago(1h)| extend command = tostring(Properties.command)| summarize avgMs = avg(Sum / ItemCount), maxMs = max(Max), n = sum(ItemCount) by command| order by avgMs desc| take 10HTTP route latency — already auto-instrumented by
Azure.Monitor.OpenTelemetry.AspNetCore, no custom code needed.
Prefer Properties["http.route"] (the route template, e.g.
/api/Schemes/{key}) and fall back to Name when the
property isn’t present:
AppRequests| where TimeGenerated > ago(1h)| where Success == true| extend label = coalesce(tostring(Properties["http.route"]), Name)| summarize p50=percentile(DurationMs, 50), p95=percentile(DurationMs, 95), p99=percentile(DurationMs, 99), count() by label| order by p95 desc| take 20Drill from a slow request to its full SQL fan-out — the highest-leverage
single query for triage. Pick an OperationId from one of the queries above
(slow request, slow EF command, anything tied to a span), and pull every row
the request produced:
let opId = "<OperationId>";union withsource=SourceTable AppRequests, AppDependencies, AppTraces, AppExceptions| where OperationId == opId| project TimeGenerated, SourceTable, Name, Message, DurationMs, Success| order by TimeGenerated ascThis turns “this query is slow in isolation” into “this route is slow because it issues this query and N others” — usually the actionable fix.
KQL pitfalls
Section titled “KQL pitfalls”A few things that bit us during real triage and aren’t obvious from the table schemas:
sqlis a reserved word in KQL.extend sql = ...parses as a syntax error. Use a different identifier (statement,cmd).- In a
unionof tables, you needwithsource=<Col>to capture the source table name. Bare$tabledoesn’t resolve in a project, anditemType(a column in the classic Application Insights schema) doesn’t exist on the workspace-native tables. Alwaysunion withsource=SourceTable A, B, C | project SourceTable, …. Avoid the obvious nameSource—AppDependenciesalready has a column called that and the union will be rejected with “column named ‘Source’ already exists”. - Unscoped
searchfans out across every workspace table — infra tables likeHeartbeat/Perf/AzureActivitydon’t share thePropertiescolumn and will fail any subsequent projection that references it. Always scope:search in (AppTraces, AppDependencies, AppEvents, AppRequests, AppExceptions) "<term>". AppMetricsrows are pre-aggregated by interval. KQLpercentile(Sum / ItemCount, 95)over them is approximate. For exact percentiles, use the App Insights Metrics blade chart UI with the percentile aggregation.
Local development
Section titled “Local development”Telemetry:ConsoleMetrics = true in appsettings.Development.json
prints OTel histograms to the console at the configured export
interval. Useful when iterating on a hot path without leaving the
machine.
What’s not (yet) instrumented
Section titled “What’s not (yet) instrumented”- Per-OData-controller spans. The HTTP request span (auto-instrumented by
Azure.Monitor.OpenTelemetry.AspNetCore) covers the route as a whole. The EF command spans break the SQL portion out. The C# post-processing time inside the controller (e.g.LoadPolymorphicCollectionsAsync) is currently the residual after subtracting EF time from the request span. IQueryMediatorhandlers. LiteBus is wired up to discover them, but the API doesn’t currently injectIQueryMediatoranywhere — every read flows through OData controllers + EF directly. If query handlers ever start being used, mirrorInstrumentedCommandMediatorto wrap them.- Frontend RUM / Web Vitals. The SvelteKit app has no client-side OTel SDK initialised today.
- Container CPU/memory dashboards or alerts. The data lands in Log Analytics from ACA defaults, but no workbook or alert rule is provisioned.