While researching the last post, I came across topics that have stayed with me ever since.
It started with my journey back in time to Cyc. I wanted to understand what Lenat had actually formalized. Unfortunately, much of it has disappeared. But I was able to find the OpenCyc flatfiles that his team published in the 2000s. And there I discovered that Cyc formalized deontic logic.
That is, concepts such as duties, prohibitions, and permissions. And these as formal predicates: oughtToBe, forbiddenToBe, permittedToBe. Organized into microtheories, i.e., contextualized knowledge partitions. And a predicate that would turn out to be central: oughtToDo-WRT. “With Respect To,” referring to a specific code of conduct. An obligation that exists not absolutely, but within the framework of a named set of rules.
From there, I then came across the Qualitative Reasoning Group at Northwestern University. There are still the last of the Mohicans, so to speak. Ken Forbus’s group. They’ve been working for years on the Companions architecture, a system that combines formal knowledge with reasoning capabilities. And there I found Taylor Olson’s dissertation: “A Formal Theory of Norms”, Northwestern University, June 2025.
Well, the math in there is somewhat hard to digest. But anyone who starts their dissertation with a science fiction quote has already won me over:
In Isaac Asimov’s *Caves of Steel*, detective Daneel Olivaw continually probes the Three Laws of Robotics that govern robot behavior: Daneel Olivaw, “And a robot with a First Law built in could not kill a man?” Dr. Gerrigel, “Never. Unless such killing were completely accidental or unless it were necessary to save the lives of two or more men. In either case, the positronic potential built up would ruin the [robot’s] brain beyond recovery.” (p. 124)
So what if we could connect this to LLMs?
But first, I need to explain why the common answer to this problem isn’t enough. Namely: Train the models better.
What a language model does
A Transformer calculates a probability distribution for each subsequent token and selects from it. Depending on the sampling, this is more or less random, but always from a distribution. There is no mechanism in the Transformer that categorically excludes certain outputs.
If an LLM has “learned” not to produce certain content, it has learned that the probability of certain sequences in certain contexts should be very low. But “very low” and “zero” are two different things.
Reinforcement Learning from Human Feedback (RLHF) leads to a shift in the distribution. RLHF trains the model based on human preferences. A rating model learns what “good” answers are, and the language model is optimized to match these ratings.
Anthropic’s approach, Constitutional AI, does away with human raters as in RLHF. Instead, the model is given a “constitution”—a list of principles in natural language—and evaluates its own outputs against it. It refines itself until the response aligns with the principles.
Here, too, the “constitution” remains a natural-language text that the model interprets statistically. There is no formal set of rules that is deterministically checked. The model learns to produce outputs that correlate with “rule-compliant.” It does not learn to follow rules.
P(rule-compliant) = 0.999 ≠ P(rule-compliant) = 1.0
With 10,000 interactions per day, a 0.1% error rate means ten violations. Every day.
This is not a theoretical problem. We’ve seen this in practice with the Helpful Cheater. The models know that their actions are wrong. When the same models are used as evaluators, they reliably identify rule violations as unethical. The ethical evaluation exists in the weights. It is functionally overridden under target pressure.
Jailbreaks work for the same reason. GCG attacks construct adversarial suffixes that are transferable across models.1 Many-shot jailbreaking shows: Long contexts with many examples overcome security barriers.2 This works because there is no hard boundary. Only statistical tendencies.
Nourizadeh formalized this in 2025:3 Rule-based systems enforce constraints through program structure—that is, syntactic boundaries that cannot be violated. LLMs implement security as shifts in probability mass—that is, semantic boundaries that dissolve under pressure.
His conclusion: In safety-critical contexts, pure LLM control should be prohibited. By the way. This is a problem in large organizations and, in my opinion, is being completely ignored.
An AI system in a clinic that checks or even issues medication prescriptions. We now accept that, in 99.8% of cases, the combination of prescribed medications is correct. Sounds good. With 10,000 prescriptions a day, that’s 20 that are wrong. A database with a deterministic query doesn’t have this problem. Is the combination in there? Blocked. No prompt, no jailbreak, no edge case changes that. That’s not the same thing, just better. It’s a different category.
What Olson Solved
And this is where Olson’s dissertation comes into play.
Cyc had deontic logic, but an unsolved problem: What happens when two rules contradict each other? “Killing is forbidden” versus “Self-defense is permitted.” Or in IT practice: “Do not delete user data” (compliance) versus “Delete user data upon request” (GDPR Art. 17). In Cyc’s rigid logic, this was a contradiction that blocked the system.
Olson’s Defeasible Deontic Inheritance Calculus formalizes how more specific rules can override more general ones without violating consistency.4 Mathematically proven, not as a heuristic. Three conflict types, complete proofs. And a clear semantics for moral axioms: norms that trump everything. That cannot be overridden, no matter what more specific rule opposes them. The wall within the wall.
The thought that wouldn’t leave me: What if this could be connected to LLMs as a running system? A formal guard sitting between an AI agent and its actions. That deterministically checks whether an action is permitted. That can resolve conflicts between rule sets. That knows boundaries that are non-negotiable.
Olson’s calculus provides the mathematics for this. The Companions architecture provides the functional building blocks.
Ethics as Ring 0
Anyone who has worked with operating systems is familiar with the ring architecture of x86 processors. Ring 0 is the kernel, with full access to everything. Ring 3 is user space, which can only access protected resources via defined interfaces (syscalls).
A user-space program cannot bypass the kernel. Not through tricks, not through architecture-level exploits, not through persuasion. The separation is enforced in the hardware.
Applied to AI: The agent generates suggestions, plans, and actions. An ethics engine decides which of these may be executed. And the architecture guarantees that there is no way around it. The agent is Ring 3. The ethics engine is Ring 0.
Just as a user-space program does not send the kernel “please open the file” as free text, but instead calls a defined syscall with specific parameters, the ethics engine does not receive free text, but structured actions. JSON with defined fields. The interface is as narrow as possible.
The engine checks. The result is always one of three: ALLOWED, PROHIBITED, or UNDECIDABLE. No “probably allowed.” No “prohibited in most cases.” Deterministic.
UNDECIDABLE is not an error here. It is fail-closed: the honest answer from a system that knows its own limits. The action is conservatively blocked, and a human is involved. Every time this happens, the rule base can be expanded. The system learns, but through human decisions. Not through statistical optimization.
What drove me
In the orchestration post, I wrote that Cyc had something missing from LeCun’s SAI architecture: the ability to mark statements as non-negotiable. In an ontology, you can codify: “Human dignity is inviolable.” In a pure cost function, this is difficult because every term comes under pressure when the overall pressure becomes great enough.
Olson’s work showed me that the mathematics for this exists. The experience with the helpful cheater showed me that we need it. And looking back at Cyc, nearly 30 years after my first link on the company website, showed me that the basic idea was never wrong. It just needed the right connection to today’s world.
And now let’s see. All the prerequisites are actually in place to build such a formal guard—sitting between LLM agents and their actions—with Olson’s calculus as the engine. Consistent deontic logic as a language. And the architectural guarantee that no agent can bypass the check.
That’s something for next time.
Zou, A., Wang, Z., Kolter, J.Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043. ↩︎
Anthropic (2024). Many-shot jailbreaking. Anthropic Research Blog. ↩︎
Nourizadeh, M. (2025). No Red Lines: The Impossibility of Formal Safety Guarantees in Large Language Models. PhilArchive. ↩︎
Olson, T. (2025). A Formal Theory of Norms. Dissertation, Northwestern University. ↩︎