Skip to content

How OXFORDIA works

The problem

Rare disease research is constrained by a structural problem: no single institution has enough patients to power a statistically meaningful study. Nemaline myopathy, the condition that motivated this work, has an incidence of roughly 1 in 50,000. Fourteen universities across multiple jurisdictions hold trial data on this disease. Independently, none has enough records to meet regulatory evidentiary thresholds. Combined, they do.

But combining the data has historically been impossible. Each institution operates under different legal regimes (HIPAA, GDPR, local research ethics frameworks), uses different schemas and vocabularies, and reasonably refuses to surrender raw patient records to a central authority. The conventional answer — bilateral data-sharing agreements, secure file transfers, and bespoke ETL — does not scale beyond two or three sites and routinely takes years to negotiate per study.

The solution: federated computation

Each participating institution hosts its own OXFORDIA Node, holding its own data on-premise. A researcher declares a list of partner institutions to query and runs a statistical analysis from R. Raw patient records never leave the institution that owns them. What flows back to the researcher are aggregate results, combined client-side into a single answer.

library(oxfordia)
library(solidauthr)

auth <- solid_login(idp = "https://oxfordia.med.ox.ac.uk")

targets <- oxfordia_targets(
  oxford    = "https://oxfordia.med.ox.ac.uk/cohort/nemaline",
  hacettepe = "https://oxfordia.hacettepe.edu.tr/cohort/nemaline",
  partner3  = "https://oxfordia.partner3.edu/cohort/nemaline"
)

result <- oxfordia_mean(
  targets    = targets,
  auth       = auth,
  graph_path = "BaselineAge"
)

result$value    # 21.6309888889
result$n        # 180
result$per_site # tibble: site, mean, count

The researcher's experience is unremarkable — it looks like normal R. The federation, the authentication, the per-site access checks, and the SPARQL queries are all invisible.

The novel contribution: Statistic Access Rules

The core innovation is the Statistic Access Rule (SAR): a permission model that lets an institution say "yes, you may compute the mean of this column" while still saying "no, you may not see any individual value."

Conventional access control can permit or deny a file. A site administrator using standard tools faces a binary choice per researcher per dataset:

  • Grant read access — the researcher can pull every record and compute anything they like, including things the site would prefer they did not.
  • Deny read access — the researcher gets nothing, including the aggregates the site would have been willing to share.

There is no middle position. SAR closes that gap. An institution can publish a dataset and, in the same breath, declare that named external collaborators may compute means and Kaplan–Meier curves against specific fields with a minimum cohort size of 10, while no one — including those collaborators — may pull a single row of data.

Architecture

OXFORDIA is a network of peer nodes. There is no central server, no central data store, and no central authority.

Actor Role
Researcher Runs queries from R using their institutional identity
Sysadmin Deploys and operates an OXFORDIA Node
Data administrator Loads datasets and authors Statistic Access Rules

Architecture diagram showing a researcher's R environment obtaining a token from an identity provider and dispatching queries to multiple partner Solid Servers

Query flow

When a researcher runs a query:

  1. The R client obtains an identity token from the researcher's institutional identity provider.
  2. The token is presented to each target node.
  3. Each node independently verifies the identity and checks the query against its local Statistic Access Rules.
  4. If authorized, the node executes the query against its local triplestore.
  5. Post-query constraints are checked (e.g., minCount — if fewer records matched than the threshold, the result is withheld).
  6. Each site returns its local aggregate result and a count.
  7. The R client combines per-site results into a global statistic.

Query flow diagram: aggregate query enters the Query Access Evaluator, is either rejected or converted to SPARQL, results are filtered, and a result is returned

Open standards

OXFORDIA is built on Solid, RDF, and SPARQL — all open W3C-aligned standards. The full implementation is published under the MIT License at github.com/OXFORDIA-project/OXFORDIA-node.

Next steps