Building in public: how EdgeAI remembers — 3 layers of memory

· Meaningful Blog

The problem with most AI assistants

They forget everything the moment you close the tab.

You tell ChatGPT your daughter's name on Monday, and by Wednesday it's gone. That's fine for generic Q&A. It's useless for a relationship management tool.

So we built EdgeAI with a 3-layer memory system.

---

The architecture

Layer 0: Ephemeral (RAM)

A 6-message sliding window. Lives only during the current conversation. Purged when the session ends.

This is your working memory — enough to keep a conversation coherent without bloating the context window.

Layer 1: Short-term (MongoDB, 90-day TTL)

After every meaningful exchange, EdgeAI extracts up to 3 key facts — with confidence scores.

Example: "Daughter is Natalia" (confidence: 0.9, category: relationship)

These facts are deduplicated via SHA-256 hashing. If you mention the same thing twice, it doesn't store it twice. Facts expire after 90 days unless promoted.

Layer 2: Long-term (MongoDB, permanent)

High-confidence facts that have been referenced multiple times get promoted to permanent storage. These are the things EdgeAI should never forget.

You can also manually verify or correct facts — instant "unlearning" if the AI gets something wrong.

---

How it stays grounded

Every response includes invisible memory tags that ground the AI's answers in stored facts. If EdgeAI doesn't have a fact, it says so — instead of making something up.

Before: "Natalia lives in Amsterdam and works at Google" (hallucinated)

After: "Natalia is your daughter" (grounded in stored fact)

Hallucination rate dropped from ~15% to under 2%.

---

The infrastructure

EdgeAI runs on a dedicated DigitalOcean droplet — separate from the main app.

  • Model: Llama 3.2 3B (open-source, quantized)
  • Server: Ollama with Nginx reverse proxy and basic auth
  • Communication: Private network between app and AI server
  • Encryption: AES-256 for personal facts, plain text for general knowledge (cacheable)

The main app sends a request over the private network. Ollama processes it. The response comes back. No public internet involved.

---

Token budget

On modest hardware (4GB RAM, 2 vCPUs), we keep total token usage under 1,400 per request:

  • System prompt: ~100 tokens
  • Memory context: ~150 tokens (7 facts max)
  • Conversation history: ~200 tokens
  • Data context: ~400 tokens
  • Response: ~512 tokens

Smart context selection means we don't dump all 100+ connections into the prompt. We filter by relevance first.

---

What's next

  • User-facing UI to view and edit learned facts
  • Scheduled fact promotion jobs
  • Embedding model for semantic search (when we upgrade hardware)

The goal is simple: an AI that gets smarter the longer you use it, without ever compromising your privacy.

What would you want an AI to remember about your professional network?