Archiving Service

An in-house archiving service designed around three intertwined axes — Korean search quality, permission consistency, and metadata-change propagation

May 2024 - Sep 2024 • 4 months

Tech Stack

PythonDjangoCeleryElasticsearchPostgreSQLRedisNGINXDockerNCP

Overview

The backend of an in-house archiving service that consolidates project materials, files, and customer information that used to be scattered across different internal tools. Django + DRF own the domain, PostgreSQL holds metadata, Elasticsearch holds the search index, and Naver Object Storage holds the original files. A tree-shaped file repository sits inside per-project / per-sales workspaces, with unified search layered on top.

Search couldn’t be plain keyword matching: Korean morphological decomposition and partial-match had to coexist in one index, per-user / per-group permissions had to apply consistently to every result, and when a parent object’s metadata (project or sales name) changed, every descendant document’s index had to follow. Those three axes shaped the design.

Tech Stack

Backend: Python, Django, Django REST Framework, Celery, Elasticsearch (Nori), PostgreSQL, Redis, NGINX
Infra: Docker, Naver Cloud Platform (NCP), Naver Object Storage

My Role

One of three backend engineers on a team of PM 1 / FE 1 / BE 3. I owned the file repository, search, CRM, and auth/permission domains — the index mapping and analyzer design, the ResourceShare permission model, the signal-driven indexing pipeline, and search-query tuning all sat with me.

Key Contributions

Dual analyzer that handles Korean morphology and partial match in one index.
- Body text is indexed by two analyzers in parallel — one for Latin / numbers, one for Korean morphology (Nori) — while short identifier fields like title, tags, and owner use 1–10 character n-grams. Applying n-gram to Korean body text would bloat the inverted index and break term-frequency statistics, so the roles are split explicitly: morphology for body, n-gram for identifiers.
Capping the score so long bodies don’t dominate short titles.
- Vanilla BM25 scoring tilts toward long documents — a 10,000-character body routinely beats a short exact-match title. I cap the body score with a log function to suppress length-driven score inflation, and weight title / tags / owner / body at 10 / 8 / 5 / 1 so the best-matching field drives the result while secondary fields contribute only 30%. Relevance is mathematically corrected for length bias.
PostgreSQL as source of truth, Elasticsearch as a denormalized search index — a read/write split.
- Mutable domain state like permissions stays in PostgreSQL as the source of truth; ES is a separate index built for search speed. Accessible document IDs are computed in PostgreSQL and passed to ES as an allow-list, so ES never carries permission logic itself. Embedding permissions in the ES mapping would force a reindex on every membership change, while ID-list filters are cached very cheaply inside ES — and keeping the permission decision in one place makes it easy to reason about. Search engine handles indexing and ranking; application handles authorization — the two roles are separated explicitly.
Index synchronization pipeline that follows changes on parent objects.
- The file index document carries a denormalized path that includes its parent’s name, so when a Sales is renamed every File underneath it has to be re-indexed too. A cascade re-index is only triggered when the changed field actually affects the index (the name), and indexing runs in two steps — delete + re-index — to avoid version conflicts. Bulk re-indexes work through the data incrementally within a memory cap and continue past per-document failures rather than aborting the run.

Troubleshooting

Search and detail-API disagreeing on what “readable” means — a CQRS boundary inconsistency.
- Problem: The search index’s permission filter only let can_read=True through, but the ResourceShare permission model treated both can_read=True and can_edit=True as read-permitted. Result: a document could be invisible in search yet still reachable by hitting /api/log/{id} directly — a silent inconsistency where the read replica (ES) and the write source (Postgres) were interpreting the same domain rule differently. Classic CQRS-boundary sync failure.
- Solution: Consolidated the permission decision into a single function (check_user_permission) that both sides must call, expressing the rule as a single Q(can_read=True) | Q(can_edit=True) | Q(is_public=True) OR combination. ES only ever receives a list of IDs; the meaning of “readable” is always decided by the application, so “visible in search” and “reachable by detail API” are now derived from the same call as an enforced invariant.
Short exact-matches losing to long bodies under BM25 length bias.
- Problem: A recurring pattern was “I searched for a project name and an unrelated document with the same word repeated several times in its body shows up above it.” Looking at the score distribution directly confirmed that vanilla BM25, on its own, always lets long documents win on accumulated term frequency, and simple boost adjustments alone weren’t enough to compensate.
- Solution: Capping body score with function_score + script_score (Math.min(score, log(score + 1) * 3)) mathematically suppresses the length-driven inflation, and layering dis_max over title / tags / owner / body with tie_breaker = 0.3 lets the best-matching field actually drive the result — bringing ranking back to operator expectation.

Impact

A system where search quality, permission consistency, and metadata-change propagation don’t trade off against each other — index mapping, query shape, permission boundary, and the signal pipeline are designed as one piece. Even though this was an internship-period project, the most valuable part was making an explicit decision about where “search engine responsibility” ends and “application responsibility” begins, and laying Korean search quality on top of that boundary.

View All Projects