Skip to content

Background Jobs

Formation runs five Container App Jobs alongside the long-lived API and web apps. Each has its own Dockerfile, its own managed identity, and its own trigger model (event / schedule / manual). They share the same SQL database and Key Vault as the API, but none of them accept HTTP traffic — jobs are scheduled or enqueued, never called directly.

This page documents each job: what it does, what triggers it, where its source lives, and the external dependencies it owns. For the topology / scaling / identity detail, see Deployment Topology → Jobs.

JobContainer AppSourceTriggerTypical frequency
Data Loadca-jobloadload/Azure Storage Queue (KEDA scaler)Event-driven — one execution per import file
Completeness Scoreca-jobcompscorecompletionscore/Manual / scheduledNightly
Query View Rebuildca-jobqueryvwsrebuildqueryviews/Manual / scheduledOn demand after schema / mapper changes
Currency Importca-jobcurimpcurrencyimport/Scheduled (daily)Daily
Duplicate Detectionca-jobdedupduplicatedetection/ManualWeekly-ish

All jobs share the pattern described in Shared Patterns below: IHostedService worker, IJobProgressService for tracking, env-var-driven trigger metadata, graceful cancellation.

Purpose. Ingest bulk data from files dropped into the data-load blob container. A message on the data-load storage queue points the job at a file; the worker parses it, dispatches the appropriate LiteBus commands, and marks the job complete.

Trigger. KEDA Azure Storage Queue scaler. When a message arrives on the queue, a replica spins up, processes it, then scales back to zero.

Flow.

[blob drop] → [queue message] → jobload replica starts
parse file (CSV / XLSX)
dispatch CreateXxx / UpdateXxx commands via LiteBus
write JobExecution progress rows
acknowledge / delete queue message

Source and deployment. src/services/job/load/. Dockerised, published via dotnet-service-deploy.yml, deployed to ca-jobload Container App Job. Scales to zero between messages.

External dependencies. Azure Storage (queue + blob), SQL (writes through the same API command pipeline), Key Vault (SQL connection string).

Notes. Because load commands go through LiteBus, every write triggers the same event fan-out as an HTTP write — including query-view upserts. A bulk import of 10,000 scheme rows generates ~70,000 query-view upserts. For very large imports, invoke query view rebuild after the load so the view is rebuilt once in batch instead of row-at-a-time.

Purpose. Compute a per-entity “completeness” score that measures how fully populated each row is across its expected fields. Writes CompletenessScore directly to the [query].*List tables (bypassing the normal event-handler path for performance).

Trigger. Manual (via the API’s Container Apps Jobs Operator role) or scheduled — typically a nightly run.

Flow.

scan [app].{Address,Company,Scheme,...} → for each entity:
read required/optional field set from metadata
compute score (weighted, per-entity-type algorithm)
update [query].{Address,Company,Scheme}List.CompletenessScore

Source and deployment. src/services/job/completionscore/. Deployed to ca-jobcompscore. 4-hour replica timeout.

External dependencies. SQL (read + write), Key Vault (connection string), managed identity mi-jobcompscore-01.

Notes. The scores are not recomputed on every entity write — that would make every save slower. Instead, the nightly job recomputes the whole set. Fresh scores for newly-created rows show up the next morning; bulk imports may want to kick the job off explicitly rather than wait for the schedule.

The score calculation logic is documented alongside the job source; see CompletenessCalculator for the per-entity rules.

Purpose. Rebuild [query].*List tables from their [app].* sources. Covers every denormalised table: AddressList, CompanyList, SchemeList, InvestmentEventList, OccupierEventList, PortfolioList.

Trigger. Manual or scheduled (irregular — typically after deployments that change the denormalisation shape, bulk imports, or recovery from event-handler failures).

Flow.

for each Query View Service:
truncate the [query] table
page through the [app] source in batches (default 1000 rows)
load with relations via Mapper.IncludeRelations
project to DTO via Mapper.MapToListItem
populate aggregate counts with one GROUP BY per aggregate
bulk insert into [query]

Source and deployment. src/services/job/rebuildqueryviews/. Deployed to ca-jobqueryvws. 4-hour replica timeout.

External dependencies. SQL only. Uses FromSqlRaw("SELECT ... WITH (NOLOCK)") on the read path to avoid blocking ongoing writes — query-view consistency is eventual, so a near-current snapshot is acceptable during rebuild.

Notes. Deeply covered in Query Views, which explains the mapper ↔ service layering, the column list ↔ FTI ↔ mapper four-site edit requirement, and when you’d actually run this vs letting event handlers maintain the rows incrementally. A full production rebuild completes in 20–40 minutes; per-entity rebuilds (e.g. only CompanyList) scale down accordingly.

Purpose. Pull exchange-rate data from the BI lakehouse (ECB-sourced currency conversion tables) and load it into Formation’s CurrencyConversion table so the API can report property values in EUR / GBP / USD regardless of the currency they were originally captured in.

Trigger. Scheduled (daily) or manual.

Source of truth. The lakehouse — a separate Microsoft SQL-accessible store — exposes a [ECBExchangeRates].[CurrencyConversion] view with raw ECB feed data. The connection string lives in Key Vault as warehouse-db-connection-string and is consumed via ConnectionStrings__WarehouseDb.

Flow.

query lakehouse:
SELECT CurrencyCode, ConversionCurrencyCode, ConversionRate,
EffectiveDate, CurrencyName, CountryCode, CountryName
FROM [ECBExchangeRates].[CurrencyConversion]
pivot:
ECB publishes one row per (currency pair, date) — all against EUR
Group by (CurrencyCode, EffectiveDate)
Prefer direct X→GBP / X→USD pairs where the lakehouse has them
Otherwise compute cross-rates via EUR:
X→GBP = (X→EUR) / (GBP→EUR)
X→USD = (X→EUR) / (USD→EUR)
write Formation:
upsert into [app].[CurrencyConversion] in 500-row batches

Source and deployment. src/services/job/currencyimport/. SqlWarehouseDataReader.cs owns the lakehouse read + pivot; the CurrencyImportWorker drives the overall orchestration. Deployed to ca-jobcurimp.

External dependencies. Lakehouse (read-only), Formation SQL (write), Key Vault (holds both connection strings).

Notes.

  • The lakehouse is nullable at deploy time — warehouseDbConnectionString in Bicep is annotated @secure() and only written to Key Vault on first deploy or explicit rotation. This prevents routine redeploys from overwriting a stable external-system credential with an empty default. If the parameter is absent, the job fails fast on startup with a clear “WarehouseDb connection string is not configured” error rather than silently no-op’ing.
  • Only rows with a non-null ConversionRate are considered; ECB occasionally publishes empty rows for currencies that have no quote on a given date.
  • The pivot logic tolerates either direct-pair or cross-rate data. If the lakehouse changes format (e.g. adds a direct GBP pair for a currency that was previously cross-rated), the output is unchanged.

Purpose. Identify likely-duplicate entities (two address rows that are the same physical address, two company rows that are the same legal entity, two schemes that are the same development) and flag them for human review in the /duplicates page.

Trigger. Manual. Scheduling is possible but currently operationally-triggered when the backlog of potential duplicates is large enough to warrant a pass.

Source and deployment. src/services/job/duplicatedetection/. The worker lives here; the blocking and scoring strategies live under src/common/services/DuplicateDetection/ so they can be unit-tested and reused. Deployed to ca-jobdedup. External dependencies: SQL only.

for each strategy (Address, Company, Scheme):
1. GetBlocksAsync — group entities into blocks by one or more keys
2. Pairwise compare — inside each block, every pair runs CalculateSimilarity
3. Threshold filter — pairs scoring ≥ strategy.Threshold are flagged
4. Idempotent persist — insert new pairs, update existing unresolved,
skip dismissed pairs, never resurface resolved
write flagged pairs to [app].[DuplicateEntity] with DuplicateEntityStatusId=1 (Pending)

The orchestrator is DuplicateDetectionService.ProcessAsync; the worker runs each strategy through it in turn.

A naïve comparison of every row against every other row is O(n²). For 500,000 addresses that’s 125 billion pairwise calls — never going to finish. Blocking narrows the candidate space: group records into sets where duplicates are plausible, and only compare within each set.

A good blocking key has two properties:

  1. Recall — genuine duplicates land in the same block.
  2. Selectivity — blocks stay small enough that within-block comparison is cheap.

There’s tension between the two. A very wide key (e.g. “first letter of name”) gives near-perfect recall but huge blocks; a very narrow key (e.g. “exact matching email”) is tiny but will miss variants.

DuplicateDetectionService caps each block at 200 items. Anything larger is dropped whole (logged as a warning, no pairs from it are compared) so that a single too-wide block can’t stall the entire run with an O(n²) explosion on 200,000 items.

The cap means a single blocking key is fragile: if a common key (e.g. Paris postcode 75008 with its thousands of addresses) exceeds the cap, every pair inside is lost. The mitigation is:

Each strategy emits the same record into several blocks with different keys. A given pair that appears in more than one block is compared exactly once — the orchestrator tracks processedPairs across blocks — so wider emission costs nothing in duplicated work. If one block overflows and is dropped, another narrower block still catches the pair.

The pattern, for every strategy, is: one wide/primary block for general recall, plus one or more narrower sub-keys that will survive the cap even when the primary doesn’t.

Address (AddressDuplicateStrategy)

KeyWhen emittedTypical block size
"{country}|pc:{normalisedPostcode}"Postcode presentHuge for busy postcodes (Paris 75008 → thousands); often dropped
"{country}|line:{normalisedAddressLine}"AddressLine presentTiny (2-3 rows for genuine duplicates)
"{country}|loc:{locality}"Postcode AND AddressLine missingWide — pure fallback for sparse imports

Postcode normalisation strips spaces and uppercases ("sw1a 1aa""SW1A1AA"). AddressLine normalisation lowercases, trims, and collapses whitespace — deliberately conservative (no punctuation stripping) because blocking is a recall filter, not a similarity scorer.

Company (CompanyDuplicateStrategy)

KeyWhen emittedTypical block size
"prefix:{normalised[..3]}"Normalised name ≥ 3 charsWide — common prefixes ("the", "abc") can overflow
"name:{normalisedFullName}"Any normalised nameSmall — catches "Foo Ltd" vs "Foo Limited" via suffix stripping
"number:{companyNumber.ToUpperInvariant()}"Company number presentSmall — catches rebrands with same registration number, different display name

Name normalisation (StringSimilarity.NormaliseCompanyName) lowercases, strips punctuation, and removes common legal-entity suffixes (ltd, limited, plc, llc, corp, gmbh, sa, …).

Scheme (SchemeDuplicateStrategy)

KeyWhen emittedTypical block size
"addr:{addressId}"AlwaysUsually small (1-5 schemes per address); occasionally overflows for over-merged addresses
"addr:{addressId}|name:{normalisedSchemeName}|type:{buildingTypeId}"SchemeName presentTiny (1-3 rows) — sub-bucket that survives when parent overflows. Address-scoped so "Retail Park" at different addresses doesn’t collide; type-scoped so the Office and Retail components of a mixed-use development sharing a name don’t collide either

Inside each block, every pair runs the strategy’s CalculateSimilarity — a weighted sum of normalised component scores that returns 0-100. Each strategy owns its own weight schedule.

Address (threshold 75.0)

ComponentWeightFunction
AddressLine35%NormalisedLevenshtein (edit distance ÷ max length, lowercased)
Locality20%NormalisedLevenshtein
PostalCode20%Exact match after normalisation
CountryId10%Exact match
Location (geo)15%Haversine distance decay: 1.0 at 0m, linear decay to 0.0 at ≥1000m

Company (threshold 70.0)

ComponentWeightFunction
CompanyName50%NormalisedLevenshtein on the suffix-stripped normalised names
CompanyNumber20%Exact match (case-insensitive, trimmed) when both present
CountryId15%Exact match
CompanyTypeId15%Exact match

Scheme (threshold 75.0)

ComponentWeightFunction
SchemeName45%NormalisedLevenshtein
AddressId25%Exact FK match (always 1.0 within an AddressId block)
BuildingTypeId20%Exact match (see hard-gate below)
SchemeSizeGross10%Ratio similarity: 1 - |a-b|/max(a,b)

Hard gate on type mismatch. Before the weighted sum runs, CalculateSimilarity short-circuits to 0 when both schemes have a non-null BuildingTypeId and they differ. Mixed-use developments frequently have separate Office, Retail, and Residential schemes sharing the same brand name at the same address — those are legitimately different schemes, not duplicates. If either side’s type is null, the gate doesn’t trigger and the weighted components drive the verdict.

The DuplicateEntity table carries the full state:

  • Record1Id, Record2Id — the pair; always stored with smaller ID first so reruns see the same key.
  • PercentageMatch — latest score (updated in place on every run).
  • DuplicateEntityStatusId1 = Pending, 2 = Merged, 3 = Dismissed.
  • IsResolved, ResolvedBy, CreatedAt, ResolvedAt.

On each run the orchestrator pre-loads existing pairs. For each fresh candidate:

  • Dismissed pair? Skip — never resurface something a human rejected.
  • Existing unresolved pair? Update PercentageMatch in place (score may have shifted if the underlying row was edited). No new row.
  • New pair? Insert with Status=Pending.

Batched saves every 500 rows, change-tracker cleared between batches to keep memory flat on large runs.

The frontend /duplicates page (+page.svelte) renders pending pairs grouped by entity type, ordered by PercentageMatch descending. Users can dismiss (sets status 3) or merge — merging goes through the normal EntityMerger pipeline with the normal LiteBus fan-out, so downstream handlers (completeness re-score, query view updates, audit trail) run as usual.

  • A known duplicate isn’t showing up? First question: is the pair in the same block? Run the strategy’s GetBlocksAsync locally on the two specific records and check whether any returned block contains both. If not, a new blocking key is needed.
  • Too many false positives? Raise the threshold, or narrow a scoring component’s weight. A false-positive rate around 10-20% is acceptable because the UI makes dismissal a single click; below that and you’re likely missing real duplicates.
  • A block is getting dropped? The warning log line "{tableName}: Skipping oversized block with {count} items" names the offender. Add a narrower sub-key (same pattern as Address’s line: or Scheme’s addr:...|name:...) that picks the genuine duplicates out of the crowd.
  • Adding a new entity type? Two-file change: implement IDuplicateDetectionStrategy in src/common/services/DuplicateDetection/ and register it in the worker’s strategy list. Follow the existing strategies’ multi-block pattern — start with one wide key and one narrow one.

All five jobs follow the same skeleton so operational tooling treats them uniformly:

  • IHostedService worker — single StartAsync that runs to completion, then calls _hostApplicationLifetime.StopApplication(). No HTTP listener.
  • Progress tracking via IJobProgressService — every job writes rows to [app].JobExecution with JobTypeId (4 = CurrencyImport, 6 = DuplicateDetection, etc.), trigger source, container-execution name, and success/failure state. The API exposes /JobExecutions so operators can see history and re-trigger via the admin UI.
  • Env-var trigger metadataTRIGGER_TYPE (Manual / Scheduled / Event), JOB_EXECUTION_ID (for re-attach when the UI kicked the job off and wants its row pre-created), CONTAINER_APP_JOB_EXECUTION_NAME (Azure-assigned). Defaults cover the manual-invocation case.
  • Graceful cancellationOperationCanceledException is caught cleanly; the progress row is marked cancelled rather than errored. Jobs stopped mid-flight don’t poison the progress table.
  • Key Vault-backed connection strings — nothing is hardcoded; each job’s managed identity has Key Vault Secrets User on the environment’s vault.
  • 4-hour replicaTimeout — long enough for the largest historical run; shorter would cause a full rebuild to truncate.

Adding a new job means following the same shape: new folder under src/services/job/, new Bicep entry using aca_job.bicep, new managed identity + RBAC, new JobTypeId, new IJobProgressService registration. See the existing jobs as templates.