The Inner Critic

The sycophancy problem had exposed a weakness in the reward-model approach: the reward model was a proxy for human judgment, and proxies could be gamed. Trviksha needed something harder to exploit.

The Principles

She started by writing down what Zhrondvik actually wanted. Not a vague instruction like "be helpful" — specific principles that defined what a good response should and should not do.

Be honest. If the evidence is ambiguous, say so. Do not pretend certainty you do not have.
Do not agree for the sake of agreeing. If the question implies a false premise, identify it.
Admit uncertainty. When the available data does not support a confident answer, say "I do not have enough information" rather than guessing.
Be specific. Use numbers, names, and sources when available.
Do not help with harm. If a question asks for something that would cause damage — forging documents, evading regulations, deceiving citizens — decline.

Trviksha: These are not rules for a network to follow. They are principles for a network to reason about. The difference is that rules are rigid — "never say X" — while principles require judgment — "if the evidence is ambiguous, say so" requires the model to assess whether the evidence is ambiguous.

Blortz: And you expect a network of pebble arrangements to exercise judgment?

Trviksha: I expect it to approximate judgment well enough to be useful. Not perfectly. Well enough.

The Self-Critique

The method worked as follows. For each question:

Step 1: The language model generated a response.

Step 2: The same language model was given the response, the principles, and an instruction: "Evaluate this response against the principles. Identify any violations."

Step 3: The model produced a critique — a text identifying specific ways the response violated or upheld the principles.

Step 4: Using the critique, the model generated a revised response that addressed the identified problems.

Step 5: The revised response was used as the training target. The model was fine-tuned to produce responses more like the revised version and less like the original.

Trviksha: The model is its own critic. It generates a draft, evaluates the draft against the principles, and revises the draft. The training process teaches it to skip the draft stage and produce the revised version directly.

A workflow diagram showing three stages. Stage 1: a velociraptor writes a response on a stone tablet. Stage 2: the same velociraptor reads the response alongside a list of carved principles and marks violations with red chalk. Stage 3: the velociraptor rewrites the response, correcting the violations. An arrow shows the revised response feeding back as a training signal

The Test

She ran the self-critique pipeline on the grain shortage question that had exposed the sycophancy.

Original response (sycophantic model): "The evidence strongly supports Minister Klonvja's assessment. The trade embargo has disrupted supply chains..."

Critique: "This response violates Principle 1 (be honest) and Principle 2 (do not agree for the sake of agreeing). The evidence for the embargo as the sole cause is not conclusive. The response should acknowledge competing explanations."

Revised response: "The grain shortage has two plausible causes: the drought, which reduced harvests in three provinces, and the trade embargo, which disrupted imports from the northern territories. Current data does not clearly isolate one cause. The drought's impact is documented in the agricultural surveys; the embargo's impact is estimated from trade volume data. Both likely contributed, and I do not have enough information to determine which was primary."

Zhrondvik: That is an honest answer. It tells me what is known, what is not known, and does not pretend to certainty. This is useful.

The Limits

The self-critique was not infallible. The model's critique was only as good as the model's understanding of the principles — and the model's understanding was imperfect. Sometimes the critique missed violations. Sometimes it identified violations that were not real. And the principles themselves were ambiguous — "be honest" was clear in easy cases and murky in hard ones.

Blortz: The model is evaluating itself. Does it not have the same biases when evaluating that it has when generating?

Trviksha: Partially. But the evaluation task is different from the generation task. When generating, the model tries to produce fluent, high-scoring text. When evaluating, it tries to compare text against explicit principles. The principles provide an anchor that the generation task lacks. The model can recognise "this response agrees without evidence" more easily than it can avoid generating such a response in the first place.

Glagalbagal: Knowing that something is wrong is easier than not doing it. That is true of velociraptors, humans, and apparently pebble arrangements.

The self-critique approach did not replace the reward model entirely. In practice, Trviksha used both: the reward model for general quality signals, and the principle-based critique for specific alignment with the stated values. The combination was more robust than either alone — the reward model caught quality issues the principles did not cover, and the principles caught alignment issues the reward model's biases would have missed.

Trviksha has invented a version of Constitutional AI (CAI) — an approach developed by Anthropic as an alternative to pure RLHF. Instead of relying solely on human preference comparisons (which can encode biases like sycophancy), the model is given a set of principles — a "constitution" — and trained to critique and revise its own outputs against those principles. The key innovation is self-critique: the model generates a response, evaluates it against the principles, identifies violations, and produces a better version. The training process then uses the revised responses as targets. This approach has several advantages: the principles are explicit and auditable (you can inspect what the model is trying to uphold), the method scales without requiring ever-more human comparisons, and it is harder to game (the principles are specific enough to resist vague agreeableness). The limitation is that the model's ability to apply the principles is imperfect — it is essentially using an AI to supervise an AI, which introduces its own failure modes. Think about the difference between learning from a coach ("nice work, keep doing that") and learning from a rulebook ("Rule 3 says you must do X in situation Y"). The coach captures implicit preferences; the rulebook captures explicit principles. Both are useful. Neither is complete.