Indexing - Ferrum

Overview

Ferrum maintains dedicated search index tables so that FHIR search queries never need to scan raw JSONB resources. When a resource is created or updated, the server extracts searchable values using FHIRPath expressions defined in the search_parameters table and writes them into typed index tables. By default, indexing runs inline — the CRUD operation indexes the resource synchronously before returning the HTTP response, so it is searchable immediately. This is the recommended mode for normal FHIR operations. For large bulk ingests where eventual consistency is acceptable, you can set fhir.search.inline_indexing: false to defer indexing to background workers. In this mode, the CRUD operation returns immediately and enqueues a background job. A separate worker process picks it up, extracts values, and writes the index rows inside a single PostgreSQL transaction.

Index tables

Each FHIR search parameter type maps to a dedicated table:

Table	Parameter type	Key payload columns
`search_string`	string	`value`, `value_normalized`
`search_token`	token	`system`, `code`, `display`
`search_token_identifier`	token	`type_system`, `type_code`, `value` (`:of-type` triple)
`search_date`	date	`start_date`, `end_date` (half-open UTC range)
`search_number`	number	`value` (NUMERIC, lossless)
`search_quantity`	quantity	`value`, `system`, `code`, `unit`
`search_reference`	reference	`target_type`, `target_id`, `canonical_url`, …
`search_uri`	uri	`value`, `value_normalized`
`search_text`	special	PostgreSQL `tsvector` from narrative HTML
`search_content`	special	PostgreSQL `tsvector` from all string values

All index tables are UNLOGGED for write throughput — data is recoverable via re-indexing from the authoritative resources table. Every row carries an entry_hash column. Inserts use ON CONFLICT (entry_hash) DO UPDATE so re-indexing the same resource is idempotent.

Indexing pipeline

Resource write (create / update / patch)
                 │
                 ▼
CRUD service enqueues indexing job
                 │
                 ▼
Worker picks up job
                 │
                 ▼
1. Fetch active search parameters for resource type
2. Begin transaction + pg_advisory_xact_lock(resource_type, resource_id)
3. Smart-delete stale index rows (only removed parameters)
4. Evaluate FHIRPath expressions → extract typed values
5. Bulk-insert into search_* tables (UNNEST array binding)
6. Update compartment memberships (CareTeam, Group, List)
7. Record index status → commit

Advisory locking

Concurrent workers (e.g. IndexingWorker and SearchParameterWorker) may try to index the same resource at the same time. Ferrum acquires a per-resource pg_advisory_xact_lock keyed on hash(resource_type, resource_id) at the start of the transaction to serialize these writes.

Smart deletion

Rather than dropping all index rows before re-inserting, Ferrum compares the current set of indexed parameter names against the incoming set. Only parameters that were removed are deleted. Combined with ON CONFLICT DO UPDATE, this minimizes write amplification in the common case where parameters haven’t changed structurally.

Value extraction

String normalization

Search strings are stored in both raw (value) and normalized (value_normalized) form. Normalization applies:

NFKD Unicode decomposition
Combining-mark removal (accent stripping)
Lowercase
Strip non-alphanumeric characters

The :exact modifier matches against value; the default match uses value_normalized.

Date ranges

Every date value is converted to a half-open [start, end) UTC range reflecting its precision:

Input	Range
`2024`	`2024-01-01T00:00Z` – `2025-01-01T00:00Z`
`2024-03`	`2024-03-01T00:00Z` – `2024-04-01T00:00Z`
`2024-03-15`	`2024-03-15T00:00Z` – `2024-03-16T00:00Z`
`2024-03-15T10:30`	exact instant (sub-second if provided)

Period types index both start and end; missing boundaries map to sentinel min/max datetimes.

Reference indexing

References are parsed into structured fields (target_type, target_id, canonical_url, canonical_version) supporting relative, absolute, canonical, and fragment forms. When a Reference also carries an identifier, Ferrum mirrors it into search_token under the same parameter name to support the :identifier modifier.

Bulk and batch strategies

Ferrum selects an indexing strategy based on the number of resources:

Count	Strategy	Detail
< 1 000	Single batch	One transaction, parameters fetched once per type
1 000 – 9 999	Chunked batches	Split into 1 000-resource transactions
≥ 10 000	`COPY FROM STDIN`	PostgreSQL bulk-load protocol, 10k–50k rows/sec

Thresholds are configurable via database.indexing_batch_size and database.bulk_threshold in config.yaml.

Re-indexing

The $reindex operation rebuilds search indexes for one or all resource types. Trigger it from the admin API:

# Re-index a single resource type
curl -X POST "http://localhost:8080/fhir/Patient/\$reindex" \
  -H "Content-Type: application/fhir+json"

# Re-index everything (enqueues one job per resource type)
curl -X POST "http://localhost:8080/fhir/\$reindex" \
  -H "Content-Type: application/fhir+json"

$reindex always runs as background jobs. Each job uses cursor-based pagination to process resources in batches of 500, keeping memory usage constant regardless of dataset size. The response includes the number of jobs enqueued and their IDs.

Caching

Two in-process caches reduce repeated work:

Search parameter cache — search parameter definitions per resource type, populated on first index and shared across all workers.
FHIRPath plan cache — compiled FHIRPath expression plans keyed by expression string, shared across all requests. Each distinct expression is compiled once per process lifetime.

Configuration

Relevant settings in config.yaml:

fhir:
  search:
    inline_indexing: true   # Index synchronously during CRUD (default)
    enable_text: true       # Index _text (narrative HTML)
    enable_content: true    # Index _content (all string values)

database:
  indexing_batch_size: 1000   # Resources per transaction chunk
  bulk_threshold: 10000       # Switch to COPY protocol above this

Set inline_indexing: false when you expect large data ingests and eventual consistency is acceptable. Admin operations like $reindex and package installs always use background workers regardless of this setting.

Known gaps

Composite search parameter indexing is not yet supported.
Date period overlap search logic is incomplete for edge cases.

Search — query syntax and parameter types
Performance — tuning index throughput
Configuration — full config reference

​Overview

​Index tables

​Indexing pipeline

​Advisory locking

​Smart deletion

​Value extraction

​String normalization

​Date ranges

​Reference indexing

​Bulk and batch strategies

​Re-indexing

​Caching

​Configuration

​Known gaps

​Related docs