LIMS — Specimen Analysis Workflow Management
A LIMS backend incrementally re-aligned in production around three decision criteria — analysis-pipeline decoupling, concurrency safety across the specimen lifecycle, and explicit time-boundary policies for long-running work
Tech Stack
Overview
The backend of a LIMS (Laboratory Information Management System) that digitizes the laboratory workflow for handling specimens sent in by clients — covering the full lifecycle of sample registration → barcoding → preparation → test execution → QC → result analysis → report generation → result delivery. Like any standard LIMS, every state transition has to be traceable as part of the chain of custody.
A LIMS has operational characteristics that go beyond a typical web service: analysis pipelines run on remote hosts for tens of minutes to several hours, a single specimen’s lifecycle is a multi-stage state machine acted on by several operators in parallel, and an outage of the remote analysis host or NFS leaves work indefinitely backed up in flight. While operating this backend, I’ve been aligning it around three decision criteria: analysis-pipeline decoupling, concurrency safety across the specimen lifecycle, and explicit time-boundary policies for long-running work.
Tech Stack
- Backend: Python, Django 4.2, DRF 3.16, PostgreSQL 15 (psycopg3)
- Auth/Security: JWT (simplejwt), IP allowlist middleware
- Infra: Celery 5.3, Redis, NGINX, Docker Compose (local / production / deploy environments)
My Role
Backend Developer. I own the analysis-pipeline dispatch architecture, specimen-lifecycle state-machine consistency, queue / worker isolation, and system-log / identifier conventions — the operational backend in general. The consistent working pattern is to land LIMS-domain requirements in stages, in production, without disturbing stability.
Key Contributions
-
Spawned remote analysis pipelines and detected completion on a separate channel.
- Analyses run on a remote host for tens of minutes to several hours, so holding an SSH session that long is fragile — a single network blip kills the analysis. I switched to SSH key-based auth and run remote commands inside a tmux session, decoupling the analysis lifetime from the backend’s SSH connection lifetime; the backend only waits for the command to launch, then returns immediately. Completion is detected on a separate channel: a polling task watches a shared NFS mount for result files written by the remote host, and analyses with no result file after 12 hours are auto-transitioned to a failure state, bounding indefinite waits at the system level.
-
Made the specimen lifecycle’s 15 state transitions concurrency-safe with row-level locking.
- A specimen moves through 15 explicit states — receipt → QC → experiment → analysis → review → delivery — plus bypass branches for re-sampling and re-experimenting that have to live in the same system. Instead of pulling in an external state-machine library, transition functions are gathered in a service-layer dictionary, and each transition runs inside a transaction that takes a row-level lock on that specimen first. With operators acting concurrently on the same specimen, two transitions are serialized at the data layer, and re-sample / re-experiment branches are handled as compensating actions that cleanly rewind in-flight analyses.
-
Isolated long-running report jobs from interactive work at the worker level.
- PDF / Excel report generation takes minutes — template rendering plus PDF merging — so a single queue would block immediate-response work like specimen intake or QC approval. Reports route to a dedicated queue with a dedicated single-concurrency worker that consumes only that queue, separating “results can be slow” work from “must respond immediately” work at the worker-pool level. The default worker also has a 1-hour time limit and memory cap so no single task monopolizes the pool.
Troubleshooting
-
SSH session disconnects mid-pipeline killing long-running analyses.
- Problem: The original setup used password-based SSH and a synchronous call that held the SSH session for the entire duration of the pipeline. Analyses run for tens of minutes to hours, so any short blip in the lab network — or even a single idle timeout — would tear the session down, kill the in-flight pipeline, and leave the backend task either hanging or force-reaped on timeout. Two structural weaknesses stacked on top of each other: a password being passed through external commands, and the analysis lifetime being tied to the SSH connection lifetime.
- Solution: Two changes, sequenced. (1) Migrated to SSH key-based authentication so passwords no longer flow through external commands. (2) Spawned the remote command inside a tmux session whose lifetime is independent of the backend’s SSH connection. The backend only waits for the launch and returns immediately; completion is detected by a separate polling task that watches the shared NFS mount for result files. With the pipeline running inside tmux, it survives any number of SSH disconnects or backend restarts — analysis success is no longer gated by network stability.
-
Parallel report-finalization tasks racing on the same specimen with stale snapshots.
- Problem: When a specimen’s report finishes, its state transitions to “delivery complete.” With report tasks finalizing in parallel for the same specimen, the snapshot the first task held raced against the second task’s update — and finalization could end up writing based on a stale prior state (like “analysis complete”). The plain ORM “refresh from DB” reads a lock-free snapshot, which isn’t safe in a concurrent finalization scenario.
- Solution: Wrapped the transition site to take a row-level lock on the specimen and update inside a transaction, serializing two finalize tasks on the same specimen automatically. The same pattern was applied to the bulk-create path during test-batch creation, preventing duplicate Analysis rows when two operators created batches concurrently.
Impact
A backend re-aligned for a LIMS domain where long-running analysis pipelines, multi-stage specimen lifecycles, and chain-of-custody requirements all apply at once. The most satisfying part is that decisions like “analysis lifetime must not be tied to an SSH session,” “specimen state transitions are serialized via row-level lock,” and “long-running work has a system-level timeout guard” all live consistently inside the same system, applied through the same toolbox.