Bluesky Facebook Reddit Email

Agents team up to strengthen AI safety checks

01.20.26 | The University of Electro-Communications

Apple Watch Series 11 (GPS, 46mm)

Apple Watch Series 11 (GPS, 46mm) tracks health metrics and safety alerts during long observing sessions, fieldwork, and remote expeditions.

Large language models (LLMs) are powerful assistants, but they can also be manipulated to bypass safety guardrails. Attackers may craft "trick instructions" designed to make an AI ignore its own rules, reveal restricted content, or follow unsafe directions. This growing class of manipulation is often referred to as prompt injection.

A key challenge is finding the right balance between caution and utility. Many defenses become too cautious: they block genuinely dangerous requests, but also reject harmless ones simply because they contain suspicious words. At the same time, widely used training examples of harmless prompts can be overly simple, making it difficult for automated detectors to improve further.

To tackle this, we developed a new approach where multiple LLM agents collaborate and learn through repeated interaction. The goal is straightforward: refuse what is truly risky, and respond normally to what is safe with fewer false alarms.

Our system divides the work into two teams:
Generator Team: creates increasingly challenging inputs patterned after real-world deception tactics-designed to test the boundary between safe and unsafe instructions.
Analyzer Team: judges whether each input is harmful or harmless and explains its decision.

The innovation is the iterative loop: when the analyzer successfully identifies a tricky case, the generator responds by producing an even more subtle one. Over repeated rounds, the analyzer becomes more reliable at spotting manipulation—even when the phrasing is designed to confuse it.

Instead of expensive fine-tuning, our method improves using in-context learning by feeding the model a growing set of short logs as part of its instructions.

Analyzer logs capture past mistakes and practical rules of thumb to avoid repeating them. Generator logs record what the analyzer handled well, plus strategies for producing harder test prompts next time. As these logs accumulate, both teams become more effective—without changing the underlying model weights.

In experiments using standard evaluation measures, our approach outperformed a baseline LLM without defenses and also exceeded three existing methods. The improvements were especially clear in F1-score, indicating a better overall trade-off between catching harmful prompts and not over-blocking harmless ones.

We plan to expand the approach with more diverse datasets and broader attack styles, aiming for robust protection in real-world deployments where adversarial inputs evolve quickly and safety systems must keep up.

Authors

Go Sato (Main) (The University of Electro-Communications)
Shusaku Egami (National Institute of Advanced Industrial Science and Technology)
Yasuyuki Tahara (The University of Electro-Communications)
Yuichi Sei (The University of Electro-Communications)

10.5281/zenodo.17586537

Data/statistical analysis

Not applicable

Addressing Prompt Injection via Dataset Augmentation through Iterative Interactions among LLM Agents

12-Nov-2025

The authors declare no competing interests

Keywords

Article Information

Contact Information

Kazuaki Oya
The University of Electro-Communications
oya@office.uec.ac.jp

How to Cite This Article

APA:
The University of Electro-Communications. (2026, January 20). Agents team up to strengthen AI safety checks. Brightsurf News. https://www.brightsurf.com/news/19NQRQ51/agents-team-up-to-strengthen-ai-safety-checks.html
MLA:
"Agents team up to strengthen AI safety checks." Brightsurf News, Jan. 20 2026, https://www.brightsurf.com/news/19NQRQ51/agents-team-up-to-strengthen-ai-safety-checks.html.