ICML LLM Rejections: What 497 Desk-Rejected Papers Mean for Peer Review

The academic world is grappling with the rapid integration of AI, and the recent ICML 2026 conference has provided a stark example of the consequences. A significant number of papers faced desk rejection following the detection of unauthorized LLM use in peer reviews. These ICML LLM rejections highlight a critical juncture for academic integrity and the evolving role of artificial intelligence in research. This incident, affecting 497 submissions, underscores the urgent need for clear policies and robust detection methods as the community navigates the complexities of AI-assisted scholarship.

The Dual-Policy Framework and Its Rationale

ICML 2026 introduced a nuanced, two-tiered policy for LLM use in reviewing, acknowledging the divided perspectives within the academic community. Reviewers were given a choice:

Policy A (Conservative): Strictly prohibited any LLM use in the review process.
Policy B (Permissive): Allowed LLMs to assist with understanding papers, exploring related works, and polishing review language.

Reviewers explicitly selected their preferred policy. Crucially, Policy A was assigned only to those who actively chose it or indicated flexibility ("I am okay with either A or B"). No reviewer who expressed a strong preference for Policy B was assigned to Policy A. This framework aimed to respect individual reviewer preferences while establishing clear boundaries for those who committed to a no-LLM approach. The dual-policy framework at ICML 2026, while aiming for flexibility, ultimately set the stage for the significant ICML LLM rejections that followed.

The Invisible Watermark Detection Method

The detection of LLM policy violations was not based on generic AI text detectors, which often struggle with false positives. Instead, ICML employed a sophisticated, targeted method rooted in recent work by Rao, Kumar, Lakkaraju, and Shah. This technique involved "watermarking" submission PDFs with hidden LLM instructions, a form of invisible prompt injection. The sophisticated detection method, employing invisible watermarks, was instrumental in uncovering the violations that led to the ICML LLM rejections.

Here's how it worked:

A dictionary of 170,000 phrases was compiled.
For each paper, two random, unique phrases were sampled from this dictionary. The probability of any given pair appearing by chance was less than 1 in 10 billion.
These two phrases were then embedded as LLM-only visible instructions within the paper's PDF. The instructions subtly prompted an LLM to include these specific phrases if the PDF was fed into it for review generation.
Detection occurred when both of these unique, embedded phrases appeared in a submitted review. This method specifically identified instances where reviewers directly copied and pasted LLM output after feeding the paper PDF to the model. It was not designed to flag minor grammar checks, stylistic touch-ups, or LLM use for brainstorming.

Every flagged instance underwent manual human verification to prevent false positives, ensuring the integrity of the detection process. While effective for direct LLM output, the method's limitations include its circumvention potential, especially if publicly known, and its focus on egregious use rather than subtle AI assistance. Social discussions on platforms like Reddit and Hacker News speculated about "prompt injection" or "watermarking," accurately identifying the core mechanism.

Repercussions of ICML LLM Rejections for Reviewers and the Research Community

The consequences for violating Policy A were significant. Of the 795 reviews (approximately 1% of all reviews) detected as LLM-generated under Policy A, 506 unique reviewers were identified. The submissions from these reciprocal reviewers – meaning authors who also served as reviewers – were desk-rejected. This resulted in the 497 paper rejections, a direct consequence of the ICML LLM rejections policy.

Beyond the paper rejections:

All detected LLM-generated Policy A reviews were removed from the system.
Reviewers who had more than 50% of their assigned reviews flagged as LLM-generated (51 individuals, about 10% of the detected violators) had all their reviews deleted and were removed from the reviewer pool entirely, a clear consequence of the ICML LLM rejections policy.

This decisive action by ICML regarding the ICML LLM rejections has sparked considerable debate. On one hand, many in the research community, particularly on social platforms, have expressed strong support for ICML's stance. Comments like "Excellent. Good riddance of these people" underscore a sentiment that maintaining academic integrity and setting a clear precedent for responsible AI use is paramount. There's a strong belief that reviewers who explicitly agreed to a "no LLM" policy should be held accountable.

On the other hand, some skepticism persists regarding the infallibility of AI detection tools, even with ICML's manual verification. Broader conversations also touch upon the general reliability of LLMs in research, with concerns about hallucinations and the potential for misleading information, contrasting with their perceived utility for boilerplate tasks or initial comprehension. The incident highlights the "divided community" on LLM use, where some see them as productivity enhancers and others as threats to intellectual rigor.

Navigating Trust and AI in Academic Peer Review

The ICML LLM rejections serve as a critical inflection point for academic peer review. They underscore the challenges in balancing the potential benefits of AI tools with the imperative to maintain trust, transparency, and intellectual honesty.

Moving forward, the research community must grapple with how to develop clear, enforceable policies that adapt to rapidly evolving AI capabilities without stifling innovation or imposing undue burdens on reviewers. For a broader perspective on the challenges and opportunities of AI in academic publishing, see this discussion on AI in academia. This includes carefully considering how detection methods will evolve ethically and transparently, especially as LLMs become harder to detect and more nuanced AI assistance emerges. Conferences also need to define what training and guidelines are necessary to help reviewers understand the appropriate and inappropriate uses of AI, ultimately fostering an environment of trust and transparency where these tools can be leveraged responsibly, ensuring human-driven critical judgment remains paramount.

The ICML LLM rejections incident is more than just a statistic; it's a stark reminder that as AI tools become more pervasive, the ethical frameworks governing their use in high-stakes environments like academic peer review must evolve in tandem. The conversation is no longer about if LLMs will be used, but how they will be integrated responsibly to enhance, rather than compromise, the integrity of scientific discourse.