S8 Knowledge Integration

Whatever you are building — a simple chatbot or a sprawling multi-agent orchestration — it rests on the same footing. Not the model you chose, not the framework, not the prompt. It rests on your data, and on whether you can account for every piece of it the model is allowed to touch.

That is the part most teams underinvest in, and it is the part that decides whether a system is a compliance asset or a compliance liability.

The foundation is data — and data is exposure

Exposing a model to data is unavoidable. A model with no access to your information is like an employee with no training: however capable it is in principle, it cannot add much value if it cannot see anything. The moment you make it useful, you have made it a participant in how your data moves around.

So the design question is not whether to give the model data. It is which data, held where, reachable by whom, and provable how. Get that right and the rest of the architecture has somewhere solid to stand.

The four sources of data a model draws on

Modern AI systems pull from four broad kinds of data store. They differ in how they are produced, how efficient they are, and — crucially — how much of a compliance problem they create.

Raw text (human-made)

The crudest form: documents, fed in more or less whole. It is simple, but inefficient and expensive. A model’s working context is small, so stuffing entire documents into it wastes that budget and quickly hits a ceiling. Useful as a fallback; rarely the right primary strategy at scale.

Vector databases (human-made)

Here you convert your data into the model’s native language — vectors — ahead of time. Instead of dumping whole documents at the model, you retrieve only the parts relevant to the task at hand. The model gets a focused set of references rather than choking on irrelevant material. This is the backbone of retrieval-augmented generation (RAG), and for most knowledge systems it is the workhorse.

Graph databases (human- and model-made)

This is where it gets interesting — and where the risk profile changes. Graphs are often built by the model rather than by a person. The model reads your corpus, picks out the key entities and the relationships between them — people and the documents that mention them, clauses and the jurisdictions they bind — and assembles a map of how everything fits together.

If the vector database is all the puzzle pieces, the graph is the completed puzzle.

The power is obvious. So is the catch: this is machine-generated data. Without careful control it adds a layer of obscurity over what data actually exists and what it represents. You no longer fully authored your own dataset, and “we are not sure what is in there” is the opening line of a compliance nightmare.

Memory (model-made)

The most interesting source, and the largest security and compliance risk of the four. Memory is frequently created, curated, and read by the model, with no point at which a human needs to see it. It is data your system generates about its own interactions and then relies on later — and by default, nobody is watching what goes in.

Source	Made by	Efficiency	Primary risk
Raw text	Human	Low	Cost, context limits
Vector database	Human	High	Over-broad retrieval
Graph database	Human + model	High	Obscure, machine-authored data
Memory	Model	High	Invisible, unaudited model-managed state

The pattern is worth sitting with: as you move down the table you gain capability, but more of the data is authored by the model and less of it is naturally visible to you. Security, privacy and compliance have to be designed in at the foundation — and the further down this list you go, the more deliberately.

Where the risk actually lives

Provider leakage

Most capable models cannot be run on your own hardware, so your data is sent to cloud servers to be processed. That is a genuine point of exposure, and it is the one people worry about first — but it is also the most tractable.

You mitigate it with two things. First, ordinary good cyber practice: encrypt everything in transit and at rest. Second, a clear Data Processing Agreement (DPA) that guarantees your data will not be used to train the provider’s models or be passed to third parties. On AWS, for example, Amazon Bedrock’s documented position is that prompts and completions are never used to train base models or shared with the underlying model providers, with encryption in transit and at rest, customer-managed keys via AWS Key Management Service (KMS), and private connectivity through AWS PrivateLink. (AWS — Bedrock security and privacy)

Pin this down contractually and technically and provider leakage stops being the scary one.

Access control — treat the model like an employee

The most useful mental model: an AI assistant is an employee’s assistant, and it should have no more privilege than the employee it works for — ideally only the minimum set of permissions it needs to do the job. Least privilege, applied to software.

The critical word is outside. All of this control must live outside the model. You cannot rely on a model to refrain from sharing information between users when it technically has access to all of it. The model has to be physically unable to reach what it should not see. Even if it is asked to fetch something off-limits — even if it writes and runs code to try — it fails, because it only holds the keys to the rooms it is allowed into.

None of that happens by default. It has to be engineered in.

Memory control — models are stateless goldfish

Here is a common misconception worth clearing up. People assume models remember things — that their neural networks change in response to what you tell them, the way a person’s do. For the systems you will meet in modern AI-powered products, that is not how it works.¹ Models are stateless. Left to themselves, they are goldfish.

When a chat with a model like Claude or GPT feels like it remembers you, that is a neat trick rather than a property of the model. A model’s memory sits separate from the model — it is a database the model reads from and writes to when appropriate. This is exactly where the employee analogy breaks down, and it matters enormously, because it makes memory a control point you actually own.

It also raises a genuinely new set of questions:

What does it remember? — what gets written to memory, and on what basis.
What does it actually know? — how you inspect the contents rather than trusting a black box.
Can one user’s memory reach another? — you share an assistant; how do you stop it gossiping?

No one likes a gossip, and the same goes for models. Cross-user memory leakage is the failure mode that turns a clever feature into a data-protection incident, and it is entirely preventable — but only with deliberate isolation.

Strong security lives outside the model

The throughline of all of the above is one principle: strong security lives outside the model.

Models and agents should be sandboxed and permissioned so that even if they can freely write and execute code, they cannot act beyond their granted access. Just like an employee with controlled access — wanting to see something is not the same as being able to. That is the cake: identity, permissions, encryption and isolation, all enforced where the model cannot reach them.

Prompt-level safeguards — input and output filters, system-prompt instructions, protection against prompt injection — are the icing. They have a genuine place, and you should absolutely use them, because why wouldn’t you? But they sit on top of the hard controls, not in place of them. A guardrail lowers the probability of a bad output; it does not make a bad action impossible. Asking your staff to be discreet is sensible — it is not the same as locking the filing cabinet.

So use both. Just be clear about which layer is load-bearing.

Guardrails lower the odds of a mistake. Hard controls make the mistake impossible. Use both — but only one of them is the boundary.

How we build this at S8 Knowledge Integration

This is the foundation we design from. Our systems run on AWS in the UK (eu-west-2), which lets every one of these problems be handled simply and transparently:

Logical scoping by identity, by default — every request is scoped to the authenticated user’s identity. We key access to the subject (sub) claim carried in the user’s JSON Web Token (JWT) — the unique, signed identifier for that user — so retrieval, memory and resources are constrained to them. Crucially this is enforced server-side, before anything reaches the model, so one user’s data does not surface in another user’s session. The model cannot widen its own scope, because the scope is decided outside it.
Hard isolation where it is warranted — for workloads that demand it, we can go further and separate tenants at the infrastructure level. That is an option rather than the default: full physical separation carries real cost and operational overhead, and for most systems identity-based logical scoping is the pragmatic, defensible choice. The point is to match the isolation model to the risk, not to over-engineer every deployment.
Everything visible and auditable — memory and machine-generated data are not a black box. They are stored where you can see them, query them and account for them.
Findable and observable — you can answer “what does the system know, and why” with evidence, not assurances.

Compliance, in the end, rests on four things: control over your data, transparency about what it is, findability when you need it, and observability over what the system is doing with it. Build those into the foundation and everything you put on top inherits them.

Get the data foundations right at the start, and the ambitious system you build later has somewhere solid to stand. Get them wrong, and no amount of clever prompting will save you.

Secure data foundations: getting your AI system right from the start