Building in Public: How EdgeAI Talks to the Cloud — Backend Infrastructure Deep Dive
· Meaningful Blog
Most AI features are a fetch call to OpenAI. Ours isn't.
When we built EdgeAI, we made a deliberate choice: no third-party AI APIs. Your data never leaves our ecosystem.
That means we had to build the entire inference pipeline ourselves — from the React frontend all the way down to a dedicated cloud droplet running an open-source LLM.
Here's how it actually works under the hood.
---
The architecture at a glance
The system is split across two servers and one database:
1. React SPA — the frontend, hosted on DigitalOcean App Platform
2. Express API Server — the orchestrator, same platform, handles:
- `/api/edgeai/command` — routes user messages through the Intent Router to the LLM
- `/api/chat-history` — selective encrypt/decrypt on read and write
- `/api/edgeai/start` and `/api/edgeai/stop` — on-demand model lifecycle
3. Ollama Droplet — a dedicated DigitalOcean droplet running Llama 3.2 3B Instruct
4. MongoDB — managed database with AES-256-CTR encrypted writes for private data
Two servers, one database, zero public AI calls.
---
How the web app talks to the droplet
The React frontend sends a POST to `/api/edgeai/command` with the user's message. The Express server acts as the orchestrator — it never forwards raw user input directly to the LLM.
Instead, the `EdgeAIService` class handles everything:
Mode detection — the service supports two modes via `EDGEAI_MODE`:
- `local` — uses node-llama-cpp for development
- `ollama` — connects to a remote Ollama server for production
Ollama connection — in production, the service connects to a dedicated DigitalOcean droplet running Ollama over HTTP. Authentication uses Basic Auth via the `OLLAMA_AUTH` env var, and the connection is verified on startup by hitting `/api/tags` to confirm the model is loaded.
Model selection — we run `llama3.2:3b` (Llama 3.2 3B Instruct). Small enough for a 4GB/2vCPU droplet, smart enough for intent classification, entity extraction, and conversational responses.
Lazy loading — the model is not auto-loaded on server start. It's explicitly initialized via the Settings page or EdgeAI UI. This keeps cold starts fast and memory usage predictable.
---
The Intent Router — how we cut token usage by 75%
Sending the user's entire data context with every message is wasteful. Inspired by NVIDIA's prompt compression research, we built a Classify → Route → Handle pipeline.
Phase 1: Intent Classification
Two-tier classification — fast regex first, tiny LLM fallback for ambiguous inputs.
Tier 1 — Regex handles ~60% of messages with zero latency. Greeting patterns catch chitchat. Action patterns like "add connection" or "create event" catch app actions. No LLM call needed.
Tier 2 — LLM uses a minimal ~80 token prompt with `maxTokens: 16`. The model picks one of four categories: `general`, `app_query`, `app_action`, or `chitchat`.
Phase 2: Smart Context Selection
Each intent sub-type gets only the data it needs:
- Stale connections → top 10 sorted by days since last contact
- Journal → filtered entries matching mentions or search terms
- Calendar → upcoming + recent events only
- Connections → category-grouped, name-filtered, capped at 30
This drops context from ~2,500 tokens to ~200–400 tokens per request.
Phase 3: Specialized Handlers
- Chitchat → deterministic template responses — zero LLM calls, zero tokens
- General → no app data injected — the model answers from training knowledge
- App Query → relevant data slice + memory context + last 6 messages
- App Action → extraction-only prompts (~110 tokens) — returns structured JSON
Result: ~1,400 tokens per request (was ~3,000). On a 3B model with 4GB RAM, that's the difference between responsive and unusable.
---
MongoDB — where everything lives
All persistent data lives in MongoDB: connections, journals, calendar events, voice notes, chat history, and learned knowledge.
But not all data is stored the same way.
Selective encryption — private data gets AES-256
We use `aes-256-ctr` with a server-side key derived via SHA-256 from `ENCRYPTION_KEY`. Every encrypted field stores an `{ iv, content }` pair — a random 16-byte initialization vector and the hex-encoded ciphertext.
What gets encrypted:
| Data | Encrypted? | Why |
|---|---|---|
| Chat messages (app_query, app_action) | ✅ AES-256 | Contains personal relationship data |
| Chat messages (general, chitchat) | ❌ Plain text | No personal data, cacheable |
| UserKnowledge (personal, relationship) | ✅ AES-256 | Names, relationship details |
| UserKnowledge (preference, goal, habit) | ❌ Plain text | Searchable, no PII |
| Session summaries | ✅ AES-256 | Cross-session context |
The decision is made at write time based on the intent type. The controller checks if the message's `intentType` is `app_query` or `app_action` — if so, the content is encrypted before it hits MongoDB, and the plain text field is cleared.
On read, the reverse happens: encrypted messages are decrypted server-side, and the `encryptedContent` field is stripped before sending to the frontend. The client never sees the ciphertext.
Deduplication with SHA-256
Every fact extracted by the knowledge system gets a `factHash` — a SHA-256 hash of the lowercase, trimmed fact text. A compound unique index on `(userId, factHash)` prevents duplicate facts.
This works for both plain and encrypted facts because the hash is computed before encryption.
---
The 3-layer memory system — how it persists
We covered the architecture in our previous post. Here's the database side:
- Layer 0 (Ephemeral) — conversation history in RAM, purged per session
- Layer 1 (Short-term) — `UserKnowledge` collection with `verified: false` — MongoDB TTL index auto-expires facts not used in 90 days
- Layer 2 (Long-term) — `UserKnowledge` collection with `verified: true` — no TTL, permanent until the user deletes them
Promotion logic — after every ~50 conversations, facts with `confidence >= 0.9` and `relevanceScore >= 3` are automatically promoted from Layer 1 to Layer 2.
Stale flag — instead of deleting incorrect facts, we set `stale: true`. The fact is excluded from retrieval but kept for audit. If the user re-confirms it later, it's un-staled.
---
The request lifecycle — end to end
Here's what happens when you type "Who should I reconnect with?" in EdgeAI:
1. React sends POST `/api/edgeai/command` with the message and conversation ID
2. Auth middleware validates the JWT and extracts `userId`
3. Controller loads the user's connections, journals, and events from MongoDB
4. EdgeAI Service calls `parseCommand()` — the intent router kicks in
5. Classify — regex matches "reconnect" → intent is `app_query/stale` (zero LLM calls)
6. Context — `buildStaleContext()` selects top 10 stale connections (~200 tokens)
7. Memory — `getUserKnowledge()` retrieves top 6 learned facts, decrypts personal ones, wraps them in memory tags
8. LLM — prompt sent to the Ollama droplet via HTTP, ~430 tokens total
9. Response — streamed back to the controller
10. Knowledge extraction — async background task extracts new facts from the exchange
11. Chat history — message saved to MongoDB (encrypted, since it's an `app_query`)
12. React — renders the response with reconnection suggestions
Total latency: ~2–4 seconds on a 4GB droplet. Acceptable for a conversational UI.
---
What we'd do differently
- WebSocket instead of HTTP polling — for real-time streaming (currently using SSE for the stream endpoint)
- Connection pooling to the Ollama droplet — right now each request is a fresh HTTP call
- Field-level encryption with MongoDB CSFLE — instead of application-level encrypt/decrypt, which would enable encrypted queries
But for a team of one shipping fast, this architecture handles production traffic, keeps data private, and runs on a $24/month droplet.
---
The stack
- Frontend — React SPA on DigitalOcean App Platform
- API — Express.js with JWT authentication
- AI — Ollama + Llama 3.2 3B Instruct on a dedicated droplet
- Database — MongoDB (DigitalOcean Managed)
- Encryption — AES-256-CTR via Node.js crypto
- Transcription — faster-whisper (local, OpenAI Whisper)
No wrappers. No third-party AI APIs. No data leaving the ecosystem.
What would you want to know about the infrastructure? Drop us a message.