Template Suggestions with Embeddings — A Day with Korean Synonyms
The "save as Template?" prompt used to require typing the exact same title twice. Making it catch semantically similar ones meant working through M0 tier limits, dimension unification, single-call reuse, and the model's blind spots in Korean.
Template Suggestions with Embeddings — A Day with Korean Synonyms
When you create the same task title twice in Fecit, a banner asks “Want to save this as a Template?” The idea: once is incidental, twice is intent.
Problem was, “the same title” meant exactly the same. Locally we tracked counts using Levenshtein distance with a 0.8 threshold — high enough that only typos and punctuation differences counted as matches. “Going for a run” and “Jogging” were two separate counters. To the user, they’re the same activity.
Catching the semantic match needed embeddings. Fecit was already using OpenAI embeddings elsewhere (marketplace recommendations, overview search), so this wasn’t new infrastructure — just pointing the existing tool at task titles. But once I started, every step had a decision sitting on top of it.
The Atlas Index Limit
First wall: MongoDB Atlas. Fecit runs on the M0 free tier, which caps Atlas Search and Vector Search indexes combined at 3 per cluster. Two were already taken (overview and product recommendations), leaving exactly one slot for tasks.
The plan was to use that slot for task records and handle the “is there a similar template already?” check with in-memory cosine. For a regular user with maybe a hundred templates, scanning all of them in Python takes a few milliseconds — fast enough.
What broke that plan was a single system user. Fecit ships with a curated library of ~80,000 shared templates owned by an internal account. In-memory cosine for that means pulling 80,000 × 2KB = 160MB across the wire on every check. Network blows up.
So I upgraded to the Flex tier — $8 base, capped at $30/month. Vector Search index limit jumps to 10. Comfortable margin for both task records and templates getting their own indexes, plus room for whatever comes next.
Dimension Unification
Next decision: embedding dimensions. text-embedding-3-small returns 1536 dimensions by default, but the model is trained with Matryoshka representation — you can truncate to lower dimensions with negligible quality loss.
| 1536 | 512 | |
|---|---|---|
| Vector size | 6KB | 2KB |
| MTEB score | 62.3% | 61.6% |
| API cost | same | same |
OpenAI charges per input token, not output dimension, so dimension reduction doesn’t save on the API. The savings are entirely on storage and search latency. For short task titles, 512 is plenty.
The catch: every existing embedding in the system was 1536. You can’t compute cosine between vectors of different dimensions, so you can’t half-migrate. Either the whole system stays at 1536, or everything moves to 512. I went with 512 and wrote a migration script — re-embedded all 7,200 overviews and the lone product (under ten cents in OpenAI cost), then cleared the embedding fields on achievers and tasks so they’d repopulate naturally as users created new records.
One Call, Two Uses
Reading the record creation route, I noticed it was already calling OpenAI. Each new record’s title and description got embedded, and the result was blended into the user’s accumulated “interest vector” (achiever.embedding) — that vector feeds marketplace recommendations.
If I added a separate embedding call for the “similar template?” check, every record creation would hit OpenAI twice. Cost is trivial, but the duplication bothered me.
The fix was to change the interface. The new helper (get_or_compute_title_embedding) takes a title and returns an embedding. The achiever update function got refactored to accept a precomputed embedding instead of generating its own. Description gets dropped from the achiever signal — most users leave it blank anyway, so the loss is small. One OpenAI call, two consumers.
”Cache” vs “History”
I started by calling the embedding store a cache with a 30-day TTL. Then I sat with it for a minute and realized that’s the wrong frame.
A cache assumes the underlying source can change, so entries can go stale. HTTP response caches work that way. But embeddings are pure functions of input — same string + same model = same vector, forever. There’s no staleness to expire.
So I dropped the TTL entirely. Renamed the collection embedding_history_collection and let entries accumulate indefinitely. Common titles (“workout”, “meeting”, “daily standup”) get hit again and again, and the OpenAI call rate trends down over time. An asset that strengthens with use, not a transient store.
Bundling Into the Response
Last decision: how to surface the signals to the client. Two pieces of information were needed — how many similar records the user has, and whether a similar template already exists. Could’ve been two endpoints.
But both queries use the same query vector — the embedding of the just-created record’s title. Splitting them across endpoints means the server re-embeds the same title twice, and the client makes two extra round-trips for the suggestion check.
So both signals ride on the create-record response itself.
const taskRecord = await createTaskRecord({...});
if (taskRecord.similarCount >= 1 && !taskRecord.hasSimilarTemplate) {
setSaveSuggestion({...}); // banner
}
Server side: insert the record → embed the title → run two $vectorSearch queries in parallel via asyncio.gather (one against task_record_vector_index, one against task_template_vector_index) → attach similar_count and has_similar_template to the response. One round-trip from the client, one OpenAI call, one set of network bytes. The mobile side stripped out the local TitleCounterStorage and the separate hasSimilarTemplate call, and just reads two fields off the response.
Korean Is Hard
Once it was wired up end to end, I tried it. Threshold sat at 0.85 cosine. The first surprise was a record I typed as “느린가?” (“is it slow?”) matching against an old “살짝 느린듯?” (“feels a bit slow?”) — and a template suggestion popped up. (Turned out to be unrelated: a regex bug in the unrelated keyword search where ? was being interpreted as a quantifier. Fixed separately.)
The real disappointment was the actual synonyms. I measured cosines directly:
달리기 (running) ↔ 조깅 (jogging): 0.399
달리기 ↔ 뛰기 (run, verb): 0.382
달리기 ↔ 런닝 (running, loanword): 0.404
달리기 30분 ↔ 회의 30분 (meeting 30min): 0.617 ← inflated by shared "30분"
운동 30분 (workout 30min) ↔ 회의 30min: 0.699
text-embedding-3-small is weak at Korean synonym detection. Worse, short shared phrases like “30분” push cosine up between semantically unrelated titles, so there’s no clean threshold that separates “these are the same task” from “these share a unit of time”.
Switched the test to text-embedding-3-large — meaningful improvement on synonyms (달리기↔뛰기 went from 0.382 to 0.667), but some pairs remained weak, and English loanwords like “런닝” didn’t budge. The 6.5x cost was still trivial in absolute terms, but the improvement wasn’t dramatic enough to justify the migration cost. Decided to stay on small + 0.85 threshold for now and revisit once there’s real usage data.
Looking Back
Working through this made me accept there’s no clean solution in this space. For a language with rich morphology and loanword soup like Korean, getting semantic similarity right on short task titles is genuinely hard for general-purpose embedding models. “Doesn’t work at all” would be one thing — but “works partially, in ways that are hard to predict” is harder to operate against. Lower the threshold and false positives creep in. Raise it and real synonyms slip through.
For now, the setting leans conservative — false negatives over false positives. “Why isn’t it suggesting anything?” is a quieter complaint than “why is it suggesting these unrelated things?” A multilingual specialist model (BGE-M3 or similar) or a hybrid with token-overlap heuristics is on the table, but I want more usage data before committing to either.
The feature itself is small. The decisions stacked behind it — dimension unification, call reuse, permanent storage, response bundling — took most of a week. Small features can hide a lot of context, and losing the thread doubles the work later, so I tried to write each decision down the day I made it.