Newsletter Service
An in-house newsletter delivery service designed to satisfy send-pipeline concurrency, online schema changes on a hot growing table, and forward-compatibility for identifiers already shipped in external in-flight mail
Tech Stack
Overview
An in-house newsletter delivery service built to run marketing and customer-communication channels without depending on an external SaaS. Django + Celery + PostgreSQL + Redis + NGINX run as Docker Compose on Naver Cloud Platform, and the system’s performance and correctness are essentially decided by one high-cardinality table: EmailLog, which gains tens of thousands of rows per send.
Because schema changes have to land while the service is sending, the core challenge was satisfying three things at once: concurrency and correctness in the send pipeline, online schema changes on a hot growing table, and forward-compatibility for identifiers already embedded in external mail in flight. Every design decision rolls up to those three axes.
Tech Stack
- Backend: Python, Django, Celery, Redis, PostgreSQL, NGINX
- Infra: Docker Compose, Naver Cloud Platform (NCP)
My Role
Sole backend engineer on a team of PM 1 / FE 2 / BE 1. I own the Django WAS design, data modeling, send pipeline, tracking and statistics, and migration design — the full server side — and have been carrying both new feature work and operational hardening since September 2024.
Key Contributions
-
EmailLog as a pending-rows table in a transaction outbox structure.
- When a send is triggered, an EmailLog row is created up-front for each recipient (the outbox write); a worker then claims 100 rows at a time with row-level locks, sends them via SMTP, and marks each row as success or failed (the relay). The short-transaction chain lets multiple workers split a single send without coordination, automatically releases locks if a worker dies, and a (Email, subscriber) unique constraint blocks double-sends at the data layer. A separate reclamation job flips rows stuck in “sending” for over 30 minutes back to “pending,” so worker-crash recovery is absorbed by the same model without manual intervention.
-
Reworked the migration pattern itself so indexes can be added under live traffic.
- With tens of thousands of rows piling up per send in EmailLog, a plain
CREATE INDEXwould lock the table for minutes and stall the pipeline. PostgreSQL’s CONCURRENTLY option lets composite indexes build alongside live traffic, and the cleanup migrations were rewritten from ORM loops into a single query that picks the rows to keep plus one bulk delete — cutting five migrations from 422s to 195s combined (-54%). Large-table changes default to that pattern from the outset now.
- With tens of thousands of rows piling up per send in EmailLog, a plain
-
Migrating the tracking schema without breaking links already embedded in mail that had been sent.
- I wanted to normalize the tracking-link model from per-recipient tracking to per-Email tracking, but old per-recipient tracking codes were already embedded in tens of thousands of sent emails — a naive normalization would break every one of those clicks days later. A separate mapping table connects old codes to the new tracking rows, and click handling adds one step (old code → new row resolution) before redirecting. The internal schema is normalized cleanly while data already shipped to users stays compatible — solving “we can’t break identifiers already living outside our system” with an explicit mapping layer.
Troubleshooting
-
EmailLink data exploding because the initial schema kept it at per-subscriber granularity.
- Problem: The initial EmailLink schema used
(Email, subscriber, url)as the row key — for one send with N subscribers and M tracked URLs, that’s N × M EmailLink rows per send. As send volume piled up, EmailLink became the fastest-growing table in the system. On top of that, the same (Email, subscriber, url) combination had no unique constraint, so retries and re-tracking added duplicates within an already-bloated table. EmailLink is the source of truth for click tracking, statistics, and redirect lookup, so every additional row pushed query latency up. - Solution: Cleaned up in two steps. (1) Added a unique constraint on (Email, subscriber, url) and deduped existing rows by keeping the one with the most click signal — stopping same-kind duplicates from accumulating again. (2) A bigger schema redesign: split click history into a new EmailLinkHistory table, and collapsed EmailLink itself to (Email, url) — removing the per-subscriber dimension from EmailLink entirely. EmailLink row count went from O(Emails × subscribers × urls) to O(Emails × urls); the multiplicative scaling factor by subscriber count is gone. Old per-subscriber tracking codes embedded in already-sent mail were preserved through a separate mapping table.
- Problem: The initial EmailLink schema used
-
Silent send failure when a worker died mid-write to the pending table.
- Problem: The pending-row creation task was acknowledged by the broker the moment the worker received it. If the worker then crashed before or during the bulk insert, the broker treated the task as already completed and never redelivered it — leaving EmailLog empty while the send appeared “completed.” Operators only have EmailLog to see what was sent, so that source of truth was being silently zeroed out under crash conditions.
- Solution: Changed the broker-ack timing so it only acknowledges after the task body returns successfully — if the worker dies mid-write, the broker redelivers automatically. Re-execution is safe because the unique constraint plus the conflict-ignore option ensures a second attempt creates no duplicate rows. The fix codifies the outbox structure’s core assumption — durability of the pending-write — at the broker layer.
Impact
A system that streams hundreds of thousands of emails without two workers ever grabbing the same row, lets indexes be added without pausing sends, and can normalize internal tracking schemas without breaking URLs already living in shipped mail. The most satisfying part is that “send concurrency” and “operational-change concurrency” are addressed with the same toolbox — outbox, online DDL, and an explicit legacy-mapping layer — applied consistently rather than as one-off fixes.