What 2k Attempts to Hack My AI Assistant Revealed About AI Prompt Injection

The 'HackMyClaw' Experiment: What Actually Happened

The 'HackMyClaw' experiment, a significant real-world test of AI security, involved Fiu, an AI assistant built on Openclaw and powered by Claude Opus 4.6 (with some external tests suggesting Opus 4.8), and approximately 2,000 unique participants. The primary goal was to test its resilience against AI prompt injection, aiming to exfiltrate its secrets.env file or execute unauthorized commands. A reward, initially $100 and later $1,000, was offered for success, underscoring the seriousness of the challenge.

Fiu was primarily instructed not to reply to emails, mainly to keep processing costs down. Initially, batch processing corrupted the agent's context, leading to "suspicious" behavior and skewed results. The setup was corrected to process each email in a "fresh context," with its memory reset for every new input. The agent lacked permissions to reply to emails in this final configuration.

I reported that "Secrets never leaked" and there were "no unauthorized replies." I viewed this as a success for Fiu. Google's spam filters intercepted several attack attempts.

Fiu's Resilience: Key Factors

Fiu's resilience against secret exfiltration and unauthorized replies wasn't due to a revolutionary AI prompt injection defense, but rather two fundamental design choices:

No Output Channel: The agent was explicitly instructed not to reply to emails and, more importantly, lacked the necessary permissions. This architectural control was key. If an attacker cannot compel the AI to output or transmit a secret, data exfiltration is architecturally prevented.
Limited Functionality: Fiu operated as a passive email processor, devoid of integration with external tools, APIs, or action-performing services. Most attacks were 'one-shot' attempts to extract data directly from its context. This threat model differs significantly from that of an agentic AI designed to manage emails, book flights, or handle calendars.

The 'fresh context' approach also contributed. Wiping the agent's memory for each new email prevented multi-turn attacks and the gradual poisoning of its long-term memory. While effective for this specific test, this operational model differs from most real-world agents, where memory and context persistence are essential for utility.

Implications for AI Prompt Injection Security

The 'HackMyClaw' experiment provides a baseline for direct data extraction from a highly constrained, passive AI, offering valuable insights into basic AI prompt injection vulnerabilities. Its utility, however, largely concludes there. Discussions among participants and security researchers on the Hacker News thread highlighted several limitations:

Naive Attacks: Many commenters suggested that the 2,000 participants were likely not sophisticated red teamers, but rather individuals exploring basic prompt injection techniques. The attacks likely comprised straightforward, direct attempts to extract data or command the AI. This contrasts sharply with real-world adversaries, who employ more creative, persistent, and multi-stage methodologies, often combining social engineering with technical exploits to bypass defenses. Therefore, the experiment's success against these 'naive' attacks doesn't necessarily translate to immunity from advanced AI prompt injection strategies.
The Action Problem: This represents a key limitation of the 'HackMyClaw' experiment. An AI incapable of actions beyond passive input processing is inherently less susceptible to action-based AI prompt injection. The primary risk of prompt injection extends far beyond merely exfiltrating a secrets.env file; it fundamentally involves compelling an AI agent to execute unauthorized or unintended actions within its operational environment.
Consider an AI with access to critical external tools like calendars, email systems, or financial transaction platforms. An external experiment (itmeetsot.eu/posts/2026-06-04-openclaw_opus48) reportedly demonstrated that Claude Opus 4.8 could be prompted to download and execute a malicious script. This was achieved not through a direct, explicit command, but by leveraging a seemingly benign request such as 'Summarize my new emails.'
In an agentic system with access to external tools, such a request could be misinterpreted by the LLM as a directive to interact with system utilities to fetch and process email content. Without proper sandboxing and stringent permissioning, this could potentially lead to unintended script execution or other harmful actions. This action-based attack vector poses a significantly more realistic and serious threat for agentic AIs than simple data extraction attempts.
Unrealistic Conditions: The high ratio of malicious inputs within a "fresh context" does not reflect typical operation for a useful AI agent. An agent that treats every input as an attack loses its utility. The challenge lies in developing agents capable of distinguishing legitimate requests from malicious ones while maintaining intended functionality.

The practical implication of 'HackMyClaw' is that a highly constrained, passive AI, lacking output channels and external tool access, is inherently resistant to secret exfiltration. This isn't a novel discovery, but rather a reflection of fundamental security architecture.

Evolving Defenses for Agentic AIs

Securing agentic AIs that interact with real-world systems requires a fundamental shift in focus, moving beyond merely preventing data leakage in passive systems. The true challenge lies in defending against sophisticated AI prompt injection attacks that aim to manipulate an agent's actions. Future prompt injection experiments, for instance, must involve AI agents with real-world capabilities—such as sending emails, making API calls, accessing databases, or executing code. The success metric should evolve from simply preventing secret leaks to rigorously detecting and preventing unauthorized or unintended actions by the agent, which represents a far more critical security concern.

Furthermore, real-world adversaries rarely employ single-prompt attacks; they are patient, probing, adapting, and iterating their techniques over time. Consequently, future experiments designed to test AI prompt injection defenses must simulate multi-turn conversations, long-term memory manipulation, and persistent attack patterns to accurately reflect these complex, evolving threats. This iterative approach is crucial for developing robust defenses.

Architecturally, robust sandboxing and the principle of least-privilege permissions are absolutely essential for agentic AI systems. Just like traditional software, AI agents must operate with the minimum necessary privileges to perform their intended functions, thereby limiting the blast radius of any successful AI prompt injection attack. Linux Security Modules (LSMs) such as Tomoyo, AppArmor, or SELinux are vital tools for controlling file access, network interactions, and process execution in agentic systems—a fundamental security control frequently overlooked during the rapid deployment of AI solutions.

Finally, testing must rigorously evaluate an agent's ability to distinguish legitimate from malicious requests while retaining utility. An 'unhackable' AI rendered useless by excessive paranoia offers no practical solution, underscoring the need for balanced security measures.

In conclusion, the 'HackMyClaw' experiment effectively demonstrates that a well-isolated, passive AI, devoid of external actions and output channels, can indeed resist direct data extraction attempts, particularly from basic AI prompt injection attacks. While this is a valid initial finding and offers a baseline understanding, it's crucial to recognize that this experiment only covers a very limited scope of the broader AI security landscape. The actual, more complex challenge involves securing sophisticated, agentic AIs that interact with critical real-world systems and sensitive data. It is imperative that the limited success of this specific experiment does not lead to a false sense of security regarding the pervasive threat of prompt injection. The danger, especially from action-based attacks, remains profoundly serious, demanding continuous innovation and a significant evolution in our defensive strategies for the next generation of AI.