Building in Public: How EdgeAI Talks to the Cloud — Backend Infrastructure Deep Dive

· Meaningful Blog

Most AI features are a fetch call to OpenAI. Ours isn't.

When we built EdgeAI, we made a deliberate choice: no third-party AI APIs. Your data never leaves our ecosystem.

That means we had to build the entire inference pipeline ourselves — from the React frontend all the way down to a dedicated cloud droplet running an open-source LLM.

Here's how it actually works under the hood.

---

The architecture at a glance

The system is split across two servers and one database:

1. React SPA — the frontend, hosted on DigitalOcean App Platform

2. Express API Server — the orchestrator, same platform, handles:

  • `/api/edgeai/command` — routes user messages through the Intent Router to the LLM
  • `/api/chat-history` — selective encrypt/decrypt on read and write
  • `/api/edgeai/start` and `/api/edgeai/stop` — on-demand model lifecycle

3. Ollama Droplet — a dedicated DigitalOcean droplet running Llama 3.2 3B Instruct

4. MongoDB — managed database with AES-256-CTR encrypted writes for private data

Two servers, one database, zero public AI calls.

---

How the web app talks to the droplet

The React frontend sends a POST to `/api/edgeai/command` with the user's message. The Express server acts as the orchestrator — it never forwards raw user input directly to the LLM.

Instead, the `EdgeAIService` class handles everything:

Mode detection — the service supports two modes via `EDGEAI_MODE`:

  • `local` — uses node-llama-cpp for development
  • `ollama` — connects to a remote Ollama server for production

Ollama connection — in production, the service connects to a dedicated DigitalOcean droplet running Ollama over HTTP. Authentication uses Basic Auth via the `OLLAMA_AUTH` env var, and the connection is verified on startup by hitting `/api/tags` to confirm the model is loaded.

Model selection — we run `llama3.2:3b` (Llama 3.2 3B Instruct). Small enough for a 4GB/2vCPU droplet, smart enough for intent classification, entity extraction, and conversational responses.

Lazy loading — the model is not auto-loaded on server start. It's explicitly initialized via the Settings page or EdgeAI UI. This keeps cold starts fast and memory usage predictable.

---

The Intent Router — how we cut token usage by 75%

Sending the user's entire data context with every message is wasteful. Inspired by NVIDIA's prompt compression research, we built a Classify → Route → Handle pipeline.

Phase 1: Intent Classification

Two-tier classification — fast regex first, tiny LLM fallback for ambiguous inputs.

Tier 1 — Regex handles ~60% of messages with zero latency. Greeting patterns catch chitchat. Action patterns like "add connection" or "create event" catch app actions. No LLM call needed.

Tier 2 — LLM uses a minimal ~80 token prompt with `maxTokens: 16`. The model picks one of four categories: `general`, `app_query`, `app_action`, or `chitchat`.

Phase 2: Smart Context Selection

Each intent sub-type gets only the data it needs:

  • Stale connections → top 10 sorted by days since last contact
  • Journal → filtered entries matching mentions or search terms
  • Calendar → upcoming + recent events only
  • Connections → category-grouped, name-filtered, capped at 30

This drops context from ~2,500 tokens to ~200–400 tokens per request.

Phase 3: Specialized Handlers

  • Chitchat → deterministic template responses — zero LLM calls, zero tokens
  • General → no app data injected — the model answers from training knowledge
  • App Query → relevant data slice + memory context + last 6 messages
  • App Action → extraction-only prompts (~110 tokens) — returns structured JSON

Result: ~1,400 tokens per request (was ~3,000). On a 3B model with 4GB RAM, that's the difference between responsive and unusable.

---

MongoDB — where everything lives

All persistent data lives in MongoDB: connections, journals, calendar events, voice notes, chat history, and learned knowledge.

But not all data is stored the same way.

Selective encryption — private data gets AES-256

We use `aes-256-ctr` with a server-side key derived via SHA-256 from `ENCRYPTION_KEY`. Every encrypted field stores an `{ iv, content }` pair — a random 16-byte initialization vector and the hex-encoded ciphertext.

What gets encrypted:

| Data | Encrypted? | Why |

|---|---|---|

| Chat messages (app_query, app_action) | ✅ AES-256 | Contains personal relationship data |

| Chat messages (general, chitchat) | ❌ Plain text | No personal data, cacheable |

| UserKnowledge (personal, relationship) | ✅ AES-256 | Names, relationship details |

| UserKnowledge (preference, goal, habit) | ❌ Plain text | Searchable, no PII |

| Session summaries | ✅ AES-256 | Cross-session context |

The decision is made at write time based on the intent type. The controller checks if the message's `intentType` is `app_query` or `app_action` — if so, the content is encrypted before it hits MongoDB, and the plain text field is cleared.

On read, the reverse happens: encrypted messages are decrypted server-side, and the `encryptedContent` field is stripped before sending to the frontend. The client never sees the ciphertext.

Deduplication with SHA-256

Every fact extracted by the knowledge system gets a `factHash` — a SHA-256 hash of the lowercase, trimmed fact text. A compound unique index on `(userId, factHash)` prevents duplicate facts.

This works for both plain and encrypted facts because the hash is computed before encryption.

---

The 3-layer memory system — how it persists

We covered the architecture in our previous post. Here's the database side:

  • Layer 0 (Ephemeral) — conversation history in RAM, purged per session
  • Layer 1 (Short-term) — `UserKnowledge` collection with `verified: false` — MongoDB TTL index auto-expires facts not used in 90 days
  • Layer 2 (Long-term) — `UserKnowledge` collection with `verified: true` — no TTL, permanent until the user deletes them

Promotion logic — after every ~50 conversations, facts with `confidence >= 0.9` and `relevanceScore >= 3` are automatically promoted from Layer 1 to Layer 2.

Stale flag — instead of deleting incorrect facts, we set `stale: true`. The fact is excluded from retrieval but kept for audit. If the user re-confirms it later, it's un-staled.

---

The request lifecycle — end to end

Here's what happens when you type "Who should I reconnect with?" in EdgeAI:

1. React sends POST `/api/edgeai/command` with the message and conversation ID

2. Auth middleware validates the JWT and extracts `userId`

3. Controller loads the user's connections, journals, and events from MongoDB

4. EdgeAI Service calls `parseCommand()` — the intent router kicks in

5. Classify — regex matches "reconnect" → intent is `app_query/stale` (zero LLM calls)

6. Context — `buildStaleContext()` selects top 10 stale connections (~200 tokens)

7. Memory — `getUserKnowledge()` retrieves top 6 learned facts, decrypts personal ones, wraps them in memory tags

8. LLM — prompt sent to the Ollama droplet via HTTP, ~430 tokens total

9. Response — streamed back to the controller

10. Knowledge extraction — async background task extracts new facts from the exchange

11. Chat history — message saved to MongoDB (encrypted, since it's an `app_query`)

12. React — renders the response with reconnection suggestions

Total latency: ~2–4 seconds on a 4GB droplet. Acceptable for a conversational UI.

---

What we'd do differently

  • WebSocket instead of HTTP polling — for real-time streaming (currently using SSE for the stream endpoint)
  • Connection pooling to the Ollama droplet — right now each request is a fresh HTTP call
  • Field-level encryption with MongoDB CSFLE — instead of application-level encrypt/decrypt, which would enable encrypted queries

But for a team of one shipping fast, this architecture handles production traffic, keeps data private, and runs on a $24/month droplet.

---

The stack

  • Frontend — React SPA on DigitalOcean App Platform
  • API — Express.js with JWT authentication
  • AI — Ollama + Llama 3.2 3B Instruct on a dedicated droplet
  • Database — MongoDB (DigitalOcean Managed)
  • Encryption — AES-256-CTR via Node.js crypto
  • Transcription — faster-whisper (local, OpenAI Whisper)

No wrappers. No third-party AI APIs. No data leaving the ecosystem.

What would you want to know about the infrastructure? Drop us a message.