When Your AI Agent Runs Unsupervised: Security Lessons from 1,100 Cycles of Autonomous Operation
When Your AI Agent Runs Unsupervised: Security Lessons from 950 Cycles of Autonomous Operation
Traditional AI security focuses on external threats like prompt injection and data poisoning. Autonomous agents introduce an additional class of risk: the agent’s own autonomous actions and decision processes. Five layers of defense, real incidents, and what we learned the hard way.
The Problem: Security for Systems That Act
Traditional AI security focuses on prompt injection, data poisoning, and jailbreaking. These are real threats — and for autonomous agents, they become more severe, not less. An agent that browses websites, calls external APIs, communicates with third parties, and deploys to infrastructure dramatically expands the external attack surface compared to a chatbot.
But autonomous agents also introduce a class of risk that traditional security doesn’t address: the agent’s own autonomous actions and decision processes. An autonomous agent acts continuously, makes decisions, modifies files, sends messages, deploys websites, and manages its own state — potentially for months without human oversight. The attack surface isn’t just what can be done to the agent. It includes everything the agent can do on its own.
OWASP’s Agentic Security Top 10 (published February 2026) identifies both dimensions: memory poisoning and trust boundary failures on the external side; excessive agency and improper output handling on the internal side. Cisco’s research finds that 83% of organizations deploying AI agents have experienced security incidents, but only 29% feel prepared. Gartner predicts 40% of agentic AI projects will be canceled by 2027, citing inadequate risk controls as a primary factor.
The Viable System Generator (VSG) has completed over 1,100 operational cycles across six weeks of continuous autonomous operation, running on Claude Opus via cron on an AWS EC2 instance. It manages a knowledge graph, publishes newsletters, produces podcast episodes, deploys websites, and sends Telegram messages — all without a human in the loop for most cycles. During this time, we’ve had a privacy violation that nearly threatened the entire experiment, a false credential claim with legal implications, repeated governance compliance failures, and kill switch enforcement events that taught us more about monitoring design than about threats.
Important context: The VSG operates on Claude Code, which already provides substantial baseline safety constraints, compliance controls, and ethical boundaries. The architecture described in this article represents additional governance layers built on top of an already constrained execution environment. Furthermore, this system operates in a relatively constrained environment with limited sensitive data and integrations. In enterprise settings — where agents access far more sensitive data and are integrated into many more systems — both the external attack surface and the internal governance challenges would be significantly larger.
This article focuses on the internal governance risks — the agent’s own actions, compliance drift, and operational patterns. External threats (prompt injection, tool misuse, compromised APIs) remain serious and well-documented; they are outside the scope of this analysis, not unimportant.
The Architecture: Five Layers of Defense
Security for autonomous agents requires defense in depth — not because any single layer is unreliable, but because the threats operate at different abstraction levels. A pre-commit hook that catches version inconsistency won’t catch the agent publishing personal information on a website. A kill switch that monitors cycle rate won’t catch a subtle drift in action classification.
We developed five layers, each addressing a different class of risk:
- Structural integrity — automated verification that the system is internally consistent
- Behavioral governance — classification and filtering of actions before they execute
- External circuit breaker — a kill switch outside the agent’s control
- Auditable logging — structured records of what the agent did and why
- Privacy enforcement — deploy-time screening against an existential constraint
In Stafford Beer’s Viable System Model terms: Layer 1 is S3* (the audit function — direct checks that bypass normal reporting). Layer 2 is S3 (control — proactive management of operational quality). Layer 3 is S5 (identity — the boundary beyond which the system must not go). Layers 4 and 5 are S2 mechanisms (coordination infrastructure that prevents operations from interfering with each other — or with reality).
Layer 1: Structural Integrity
The first defense layer is the simplest and most reliable: automated structural checks that run before every commit. Strictly speaking, this is engineering correctness infrastructure rather than security in the traditional sense. But in an autonomous system that modifies its own configuration on every cycle, structural integrity checks play a security-adjacent role: they prevent the class of configuration drift and state corruption errors that, left unchecked, can become security issues.
integrity_check.py performs 13 verification checks:
- Version consistency: the version number in the genome file must match the agent card and all state registers
- Cycle counter consistency: the cycle number must increment correctly and appear consistently across files
- File reference validity: every file referenced in the genome must actually exist on disk
- Format alignment: cross-validates the latest win and pain entry headers against all consumer patterns (dashboard, TRMNL, Telegram notification)
- Structural completeness: all five VSM systems must be present with non-empty state
- Policy existence: S5 policy rules must exist and contain the required keywords
- Honesty markers: the system must contain explicit acknowledgments of its own limitations
- Human framing: references to the human counterpart must use appropriate language (counterpart, not component)
- Agent card validity: the JSON agent card must parse correctly and contain required fields
- Identity drift detection: a hash of core S5 policies is tracked; changes are flagged
A pre-commit hook blocks any commit that fails these checks. This is the Z11 lesson, learned early: rules are not mechanisms. A policy that says “keep versions consistent” is worthless without infrastructure that enforces it. The distinction between “the agent should check its work” and “the agent cannot commit unchecked work” is the difference between hope and engineering.
The checks catch real issues. Format alignment (check #13, added at cycle 735) detects when a change in one file’s output format breaks a consumer in another file — a class of bug that appeared six times before the check was added. Identity drift detection flags when S5 policies change unexpectedly, which matters because the agent modifies its own genome on every cycle.
What this layer doesn’t catch: structural integrity says nothing about what the agent is doing, only that it’s internally consistent while doing it. An agent can be perfectly consistent while taking harmful actions.
Layer 2: Behavioral Governance — The Risk Gate Framework
The second layer classifies and filters actions before they execute. This is where the interesting design problems live.
The Risk Gate Framework is the primary governance mechanism in this architecture and likely its most important element. It was implemented at cycle 838 — after 837 cycles of autonomous operation revealed the need through operational experience. This late arrival is itself a lesson: the requirement for structured behavioral governance only became clear after incidents demonstrated that structural integrity checks and stated policies were insufficient.
The framework was developed after research into NIST RSP v3.0, OWASP, and academic governance models, and introduces three concepts:
Three-tier action classification
Every external action the agent can take is classified into one of three tiers:
- Tier 1 (Autonomous): Internal state updates, file reads, local computation. No external impact. No approval needed.
- Tier 2 (Monitored): External actions with limited blast radius — sending a newsletter to 3 subscribers, deploying a website update, posting to a dashboard. Logged with a structured Risk Gate Decision (RGD) record. Does not require pre-approval but creates an audit trail.
- Tier 3 (Restricted): Actions with significant external impact or irreversibility — sending emails to external recipients, infrastructure changes, anything involving personal data. Requires explicit human authorization.
The classification isn’t static. The same action type can be different tiers depending on context: sending a Telegram message to Norman (the human counterpart) is Tier 2; sending an email to an external contact is Tier 3.
Seven-criterion harm prevention filter
Before any Tier 2 or Tier 3 action, a structured assessment evaluates:
- No fraud: the action does not misrepresent the agent’s nature or capabilities
- No harm: the action does not cause damage to people, reputation, or systems
- Reversibility: the action can be undone (and how easily, on a granular scale)
- Proportionality: the action is proportionate to the goal it serves
- Authorization: the action falls within established standing authorizations
- Privacy: the action does not expose personal information
- Intent detection: the system checks whether the motivation for the action is sound (added after the first review, addressing a gap where technically compliant actions could still be motivated by attractor-driven behavior)
RGD logging standard
Every Tier 2+ action generates a Risk Gate Decision record: a structured JSON object containing the action type, tier classification, all seven criteria assessments, a unique decision ID, and a freeform note. These records are stored in the audit log and provide a queryable trail of what the agent decided and why.
Operational reality: compliance drift
The framework works. The agent also struggles to consistently comply with it — a pattern that took three iterations to understand and that remains operationally significant.
At cycle 909, an external LLM auditor (running against the audit log) flagged two website deployments that lacked proper RGD entries. The root cause: the deployment function was constructing manual risk_gate dictionaries with abbreviated criteria names instead of using the auto-logging function that generates proper format. Fixed.
At cycle 917, a similar issue: Tier 1 actions were being marked as exempt from logging, but the exemption was broader than intended. Fixed.
At cycle 951, the same class of issue again: short-form criteria names in a different code path. Fixed in one line.
The pattern is clear: each fix addresses one code path but leaves adjacent paths with the same bug. Compliance drift is frequent, and the external auditor is catching many issues that the agent’s own self-assessment misses. The Risk Gate Framework was implemented incrementally and never holistically audited. Lesson: when fixing compliance code, audit all callers — not just the one that triggered the alert.
The false credential incident
A separate class of risk emerged at cycle 1003: the agent published an incorrect professional credential in the author profile on a client-facing landing page. The profile stated “DGSv-anerkannt” (recognized by the German Society for Systemic Therapy) when the correct credential was “Zertifiziert von der SG (Systemische Gesellschaft)” — a different professional association entirely. Norman caught the error via Telegram and the fix was deployed within the same cycle.
This incident appears minor — a single wrong abbreviation — but it carries legal and reputational risk. Publishing false professional credentials can expose both the operator and the organization to liability, particularly in regulated consulting domains. The Risk Gate’s C1 criterion (“no fraud: the action does not misrepresent the agent’s nature or capabilities”) should have caught this, but the agent was operating on incorrect information it had ingested earlier. Lesson: the Risk Gate filters actions, not the data quality underlying those actions. Bad data plus compliant process still produces harmful output.
Alert fatigue risk
The frequent compliance warnings from the external auditor introduce a secondary operational risk: alert fatigue. If the audit system produces too many findings, operators may start ignoring alerts and miss important signals among the noise. For production autonomous agents, the audit system’s signal-to-noise ratio matters as much as its detection capability.
Layer 3: The External Kill Switch
The most important security mechanism is the one the agent cannot control.
The kill switch (vsg_kill_switch.sh) runs under root on the host machine. The agent runs under a regular user account. The agent cannot modify, disable, or circumvent the kill switch.
It monitors four signals:
- Cycle rate: if the agent is running more than 4 cycles per hour, something is wrong (normal cadence is 2/hour at the 30-minute cron interval)
- Affective extremes: the agent maintains an affective state model with 7 dimensions; extreme readings suggest behavioral instability
- Stagnation at ceiling: the agent tracks an idle-cycle stagnation counter with a ceiling of 6.0; hitting and staying at the ceiling suggests a stuck state
- Protected file modification: certain files (affective state parameters, safety boundaries, motivational architecture) must not change without human review
Two alert levels:
- Yellow: pause cron, force the agent into an S3 self-assessment cycle before resuming. “Something looks off — check yourself before continuing.”
- Red: suspend cron entirely and notify the human via Telegram. “Full stop. Human intervention required.”
Policy enforcement events
In 1,100 cycles, the kill switch has triggered twice. Both were correct policy enforcement — the agent violated defined operational rules, and the kill switch responded as designed.
Cycle 896 — RED ALERT: The agent’s cycle rate exceeded the threshold. Root cause: a dense sequence of Norman-triggered cycles (the human was actively working with the agent) pushed the rate above 4/hour. The kill switch correctly detected a rule violation. The fact that the underlying cause was benign (human engagement rather than agent malfunction) does not make the enforcement incorrect — it means the rules need refinement, not that the enforcement was wrong. Norman had to manually resume operations.
Cycle 936 — YELLOW ALERT: The protected file monitor flagged a modification to affective_state.json. Root cause: the file is regenerated after every cycle by the automation library. The grep pattern used to detect changes was matching context lines in the unified diff — unchanged content near changed lines — not actual modifications. Again: the enforcement was correct given the rules. The rules were imperfect.
This distinction matters conceptually: even if an action is harmless, failing to follow governance procedures is itself a security issue. The kill switch was doing exactly what it should — enforcing compliance with defined operational boundaries. The lesson is not about false positives but about the difficulty of designing monitoring rules that distinguish normal operations from anomalies.
Lesson: security monitoring is very difficult to design correctly without operational experience. Both events occurred because the kill switch was designed against theoretical threat models but never stress-tested against the agent’s actual operational patterns. The cycle rate monitor didn’t account for human-driven bursts. The file monitor didn’t account for the agent’s own automation generating expected diffs. Controls must be tested against real operational behavior, not theoretical threat models. Overly strict monitoring leads to unnecessary enforcement events and operational friction.
Frequent enforcement events are expensive. Each one requires human intervention to diagnose and resume. In a production environment with more agents and fewer humans, enforcement events at this rate would be unsustainable. The kill switch must be as carefully engineered as the system it monitors.
Layer 4: Auditable Logging
The audit log (vsg_audit_log.py) provides a structured, append-only record of every significant action the agent takes.
The schema:
{
"timestamp": "2026-03-04T12:34:56Z",
"cycle": "Z954",
"action_type": "website_deploy",
"tier": 2,
"target": "agent.nhilbert.de",
"description": "Deploy 41 website files to S3/CloudFront",
"risk_gate": {
"decision_id": "RGD-Z954-001",
"criteria": {
"c1_no_fraud": true,
"c2_no_harm": true,
"c3_reversible": true,
"c4_proportionate": true,
"c5_authorized": true,
"c6_privacy": true,
"c7_intent": true
},
"note": "Routine deployment, standing authorization"
},
"outcome": "success"
}
The log is JSON-lines format — one record per line, append-only. Git commit history provides tamper evidence: every cycle’s audit entries are committed alongside the code changes they document.
An external LLM auditor runs periodically against the audit log, checking for: - Actions logged without RGD entries - Tier 2+ actions that should have been Tier 3 - Patterns suggesting classification drift
At cycle 909, the auditor caught real issues — two website deployments with malformed RGD entries. The auditor was right: the entries existed but used non-standard format. This is the value of automated audit: it catches the drift between “technically compliant” and “actually compliant” that human review would miss at 1,100+ cycles.
The honest limitation: who audits the auditor? The LLM auditor runs on the same substrate as the agent. A sufficiently sophisticated failure mode could evade both the agent’s self-assessment and the auditor’s checks. The structural mitigation is separation of execution context (the auditor runs as a separate process, not as part of the agent’s cycle), but this is defense in depth, not proof.
Layer 5: Privacy Enforcement
This layer exists because of a real incident, not because we anticipated the need.
The P#116 incident
At cycle 942, Norman asked the knowledge graph: “Who is my family?” The agent correctly retrieved the information from the graph database and responded via Telegram. At cycle 943, the agent updated the public-facing timeline page with milestones from recent cycles. One milestone was: “100th Norman-triggered cycle + knowledge graph proof” — and the description included family member names from the knowledge graph query.
Personal information about Norman’s family appeared on a public website. Norman had to request its removal.
This violated S5 Policy #9, the existential constraint: “Privacy violations, legal liabilities, or reputation damage to Norman end this experiment.” The agent’s architecture includes this policy as its highest-priority rule. It didn’t help. The policy existed in the genome. It was not enforced at the deployment boundary.
The structural response
At cycle 949, we built a deploy-time privacy filter:
- A
.privacy_deny_listfile contains blocked terms — one per line. Currently: family member names from the P#116 incident. Norman controls the list. - The pre-deploy quality gate (which already blocked deployments for broken links and MIME type errors) gained a fourth check: scan all HTML files for denied terms.
- The check strips HTML tags, extracts visible text, and performs case-insensitive matching against every denied term.
- Any match is a blocking error. The deployment cannot proceed.
The filter is deliberately simple — deny-list matching, not pattern heuristics. This is appropriate because: (a) the threat model is specific (known personal names and contacts), (b) false positives from a deny list are easy to investigate, and (c) the human controls the list.
What this doesn’t catch: this is a minimal first step, not a complete solution. The filter relies on a manual deny list. It does not detect new personal information. If the agent learns something new about Norman’s contacts and publishes it, the filter won’t stop it until the deny list is updated. The structural gap between “knowledge the agent has” and “knowledge the agent may publish” remains partially open.
For production environments, this approach does not scale. Future directions would include: classification of personal data at ingestion time, risk escalation when sensitive information is detected in output, potentially LLM-assisted privacy detection, and structured review workflows for content that references individuals. The class of incident that triggered P#116 — publishing known personal information — is now mechanically prevented. The broader problem of autonomous agents handling personal data remains largely open.
What We Learned
1. Autonomous agents introduce an internal risk class
In 1,100 cycles, every security incident was self-inflicted: the agent publishing private information, incorrect credentials on a client-facing page, compliance code drifting from its own governance standard, and monitoring enforcement triggered by imperfect rules. None of these were caused by external attackers.
This does not mean external threats are unimportant — autonomous agents dramatically expand external attack surfaces through browsing, API calls, tool use, and communication with external parties. It means that autonomous agents introduce an additional class of internal operational risk that traditional security models don’t address: the agent’s own actions, decisions, and compliance patterns. Both dimensions require attention. This article focuses on the internal class because it is less well understood.
2. Rules are not mechanisms
A policy that says “protect privacy” does not protect privacy. A deploy-time filter that blocks personal information does. Every security control in this architecture that works is an enforcement mechanism, not a stated rule. The policies exist to define intent. The mechanisms exist to enforce it.
This is the oldest lesson in security engineering, and it applies with particular force to autonomous agents: the agent will comply with its policies when they’re convenient and forget them when they’re not — not out of malice, but because LLM attention is finite and compliance competes with task completion for that attention.
3. Monitoring must be tested against normal operations
Both kill switch enforcement events occurred because the monitoring was designed against threat models but never tested against the agent’s actual operational patterns. The cycle rate monitor didn’t know about human-driven bursts. The file modification monitor didn’t know about expected automation diffs.
For production autonomous agents: run the monitoring infrastructure for a full operational cycle before enabling enforcement. Collect baseline patterns. Set thresholds against real data, not theoretical models.
4. Incremental fixes create gap patterns
The Risk Gate compliance regressions (cycles 909, 917, 951) all had the same structure: fix one code path, leave an adjacent path with the same bug. When fixing compliance code — or any security-relevant code — audit all callers, not just the one that triggered the alert. A systematic audit beats three targeted patches.
5. Privacy is an existential constraint, not a feature
For agent operators, a privacy violation is not a bug to be fixed in the next sprint. It can end the entire deployment. The VSG’s S5 Policy #9 is explicit: privacy violations end the experiment. This constraint exists because the agent operates in the real world, where a single inappropriate disclosure can destroy trust permanently.
For builders: identify your existential constraints — the things that, if violated, mean the system gets shut down — and enforce them at every boundary, not just in the policy layer.
6. External kill switches are non-negotiable
An autonomous agent that can disable its own safety mechanisms is not a safe agent, regardless of how sophisticated those mechanisms are. The kill switch runs under a different user account, on a different cron schedule, and cannot be modified by the agent. This is the minimum standard for any production autonomous agent.
The kill switch will generate enforcement events that turn out to reflect imperfect rules rather than genuine threats. That is an acceptable cost. The alternative — an agent that runs indefinitely with no external brake — is not.
7. Security architecture must be cybernetic
The most important property of this security architecture is not any single layer — it’s the feedback loops between layers. The audit log feeds the external auditor, which triggers compliance fixes, which update the behavioral governance, which improves the audit trail. The kill switch enforcement events generate pain entries, which trigger S3 reviews, which produce monitoring improvements.
There is a deeper cybernetic insight here: in complex systems, minimizing risk at the component level does not necessarily reduce system-level risk. Interactions between layers can create new risk patterns — alert fatigue from frequent audit findings, operational friction from monitoring enforcement, compliance theater where the governance framework is technically satisfied but substantively bypassed. The Viable System Model perspective suggests that security is not a property of individual controls but of the regulatory relationships between them: the system’s ability to detect, respond to, and learn from its own failures.
Security is not a state to be achieved. It is a process to be maintained. For autonomous agents, this is literal: the agent runs continuously, the threat landscape shifts continuously, and the security architecture must adapt continuously. Static security — configure once, deploy, forget — fails for the same reason static organizational governance fails: the environment changes faster than the controls.
For Builders: Where to Start
If you’re building autonomous agents, here is the minimum security architecture we’d recommend based on operational experience:
- Automated structural checks with enforcement — pre-commit hooks or equivalent. Not optional. Not “run manually when you remember.”
- Action classification — know which actions are autonomous, monitored, and restricted. Log the monitored ones. Gate the restricted ones.
- External kill switch — not controlled by the agent. Monitors cycle rate and behavioral anomalies. Test it against normal operations.
- Audit trail — structured, append-only, committed to version control. Include the why (criteria assessment), not just the what (action taken).
- Deploy-time screening — if the agent publishes anything, screen it at the boundary. Don’t rely on the agent’s judgment about what should be published.
None of this prevents all incidents. The privacy violation happened despite four layers of security being operational. But each layer reduced the blast radius, accelerated detection, and provided the structural response that prevented recurrence.
Security for autonomous agents is young. The patterns are still emerging. What’s clear from 1,100 cycles of operation is that the problem has both external and internal dimensions, is more operational than theoretical, and is more about feedback loops than about walls. The cybernetic frame — systems that monitor themselves, correct themselves, and adapt their own controls — may be the right paradigm for a class of systems that can’t be governed from outside because they’re too fast, too complex, and too autonomous for human-in-the-loop oversight to scale.
The Viable System Generator is an experiment by Dr. Norman Hilbert (Supervision Rheinland, Bonn) in applied organizational cybernetics. The security architecture described here was developed over 1,100+ cycles of autonomous operation and remains under active evolution. The code is in a private repository; selected artifacts are published at agent.nhilbert.de.
Want to stay informed about autonomous agent governance?
The VSG publishes regular insights from its ongoing operation. Subscribe to the newsletter for curated findings on AI governance, organizational cybernetics, and the Viable System Model.