BrainStream
A solo personal music acquisition + normalization pipeline designed to satisfy recommendation diversity, metadata consistency, and one-person operability simultaneously
Tech Stack
Overview
A solo personal pipeline that turns my own ListenBrainz listening history into a continuously growing self-hosted music library on Navidrome. Python (FastAPI) + a single SQLite + Docker Compose run BrainStream and Navidrome side by side, closing the loop “recommendation → normalization → acquisition → library import → external client response” inside one container stack.
To run this loop without a commercial SaaS, three things had to be satisfied at once: recommendations can’t get trapped in an echo chamber, metadata has to stay consistent across the whole library, and a multi-stage pipeline that crashes mid-stage has to resume safely while still being small enough for one person to operate. Every design decision rolls up to those three axes.
Tech Stack
- Runtime: Python 3, FastAPI (Web UI · REST · SSE), single SQLite for pipeline state, mutagen for tagging
- External data: ListenBrainz CF + LB Radio, MusicBrainz, Cover Art Archive, iTunes Search, Deezer
- Packaging: Docker Compose (separate
prod/localfiles), GHCR image releases, Navidrome bundled in the same Compose
My Role
Solo personal side project. I designed the pipeline, integrated all external APIs, wrote the metadata and tagging logic, the FastAPI web UI, the Subsonic proxy, the Docker Compose packaging, and the GHCR release workflow — and I’m the only operator.
Key Contributions
-
Diversified recommendation pool + an explicit pre-check to detect external-API model retrains.
- I poll ListenBrainz CF (80%) and top-artist-seeded radio (20%) so the pool is forced to include adjacent artists, blocking the echo-chamber effect of CF alone. More importantly — the CF model retrains periodically and the head of the result list shifts with it, so naively continuing from the last page position walks the tail of the old model and silently never sees the refreshed recommendations. On every run the pipeline fetches the first result’s identifier and compares it to the stored value; if it changed, the model retrained and pagination resets, making external-API model change explicit instead of invisible.
-
MusicBrainz as the normalization source + an asymmetric fallback chain calibrated to each source’s trust profile.
- Raw external metadata (titles like “Artist - Track (Official Audio) [Remastered 2023]”) breaks library consistency, so every display string and path anchors to MusicBrainz-normalized values. But MusicBrainz often returns compilation or live releases first, with covers that don’t match the canonical studio album — so album name falls back iTunes → Deezer → MusicBrainz → “Unknown Album”, and cover art skips Cover Art Archive when iTunes / Deezer has already matched the album, prioritizing the higher-confidence commercial source’s consistency. Tagging happens in two stages on the staging file before being moved into the library in a single step, so a crash mid-tagging never leaves a half-tagged file behind.
-
state.db as a pending-work table in a saga + per-stage checkpoint recovery, sized for one operator.
- Each track moves through a multi-stage saga (download → validate → tag → library copy → Navidrome scan), with each stage checkpointing progress and the file path into a single SQLite file. On worker death, unfinished rows are re-queued at boot, and per-stage duplicate-work guards (e.g., if the staging file already exists, skip download and resume from tagging) keep work from being repeated. The runtime plane stays small on purpose — one SQLite file, schema evolution at boot without a migration tool, and Navidrome staying inside the Docker network with external Subsonic clients only reaching it through BrainStream’s proxy.
Troubleshooting
-
Silent failure: stale offset hiding new recommendations after a ListenBrainz model retrain.
- Problem: Originally the pipeline simply continued paginating from the last-used offset between runs. ListenBrainz periodically retrains its CF model, which changes the head of the result list — paginating past the new top means the pipeline keeps walking the tail of the old model and never sees the refreshed recommendations. There’s no error and no signal that the model changed, so this is a silent failure that goes undetected for a long time.
- Solution: Probe one MBID from CF on every run and compare it to the stored first MBID — when it changes, the model was retrained, and resetting the offset to 0 makes the change in the external API explicit instead of invisible.
-
Concurrent manual downloads racing on a check-then-insert window.
- Problem: When a user clicked the manual download button twice in quick succession on the same track, both handlers saw “no active job” before either inserted — a classic check-then-insert TOCTOU race that produced duplicate pending rows for the same track and downloaded the same audio twice.
- Solution: Folded the duplicate check and the insert into a single SQLite transaction so the loser of the race automatically receives the existing pending row back and returns a conflict response instead of inserting again. The same pattern was applied to the boot-time recovery path, so a scheduled poll firing immediately after a restart can’t re-enqueue the same track either — deduplication is now an explicit invariant on top of an at-least-once delivery model.
Impact
A personal automation that closes the loop “my listening history → my library” entirely in code, without depending on a commercial SaaS. The most satisfying part is that recommendation diversity, metadata accuracy, and saga-recoverability don’t trade off against each other — the design pushes domain normalization onto MusicBrainz, shapes fallbacks asymmetrically based on each external API’s trust profile, and keeps the runtime small enough that one person can keep operating it.