Tuesday, March 11, 2025

Anthropic’s New AI Tool Blocks Jailbreaks and Harmful Content

AP·Newsis
AP·Newsis

AI companies are strengthening their internal censorship efforts.

They are working to prevent users from bypassing AI’s built-in restrictions and generating harmful content, known as “jailbreaking.”

As tech giants like Microsoft and Meta Platforms strive to block such jailbreaks, AI startup Anthropic has introduced innovative technology to address this issue.

The Financial Times reported Monday that Anthropic announced a new system called “constitutional classifiers” in a research paper.

Anthropic’s constitutional classifier system operates as a gatekeeper at the highest level of large AI language models (LLMs)er. Integrated into the Anthropic chatbot Claude, it can monitor the inflow and outflow of harmful content.

Anthropic’s breakthrough comes as “jailbreaking” has become a significant concern in the AI industry.

AI systems are built on rules, often called a “constitution,” to ensure they remain under human control. This constitution determines which topics are allowed and restricted and adapts to expand its scope over time.

Jailbreaking refers to attempts to manipulate AI models, bypassing their built-in safeguards to generate illegal or dangerous information, such as instructions for manufacturing chemical weapons.

Other AI companies are also struggling to prevent such jailbreaks.

They aim to address potential regulatory concerns from jailbreaking incidents proactively.

Microsoft introduced a jailbreak prevention tool called Prompt Shields in March. Meta followed suit in July, adopting a similar system. Although researchers quickly found ways to circumvent these measures, the companies have since improved their defenses.

Mrinank Sharma of Anthropic stated that while the primary motivation for developing jailbreak prevention measures is to prevent serious threats like the manufacturing of chemical weapons, the process also yields practical benefits by finding ways to respond and adapt to jailbreak attempts quickly.

While Anthropic has not yet implemented this jailbreak prevention system, it plans to incorporate it into future, more powerful AI models.

To enhance system efficiency, the company also rewards “bug hunters.” Individuals who successfully breach Anthropic’s jailbreak prevention system can earn up to $15,000 in rewards.

These ethical hackers, known as red team members, have invested over 3,000 hours trying to bypass the jailbreak prevention system.

When Anthropic’s Claude 3.5 Sonnet model was equipped with the constitutional classifier jailbreak prevention system, it successfully blocked over 95% of these hacking attempts. However, when the system was disabled, the blocking rate plummeted to 14%.

Hot this week

Genesis Captures the Northern Lights in New Documentary Featuring the GV60

Genesis launches an aurora documentary featuring the GV60, showcasing its new Tromsø Green color and EV technology in Arctic exploration.

World’s Hairiest Person: 18-Year-Old Indian Boy Sets Guinness Record

Indian boy Lalit Patidar, recognized as the world's hairiest person, has 201.72 hairs/cm² due to Ambras Syndrome, a rare condition.

Steel and Aluminum Tariffs: Trump Administration’s First Strike on South Korean Imports

The Trump administration imposes a 25% tariff on South Korean steel and aluminum, impacting U.S. exports and trade relations.

Zelenskyy Pushes for U.S. Military and Intelligence Aid During Saudi Negotiations

Ukraine seeks military support from the U.S. in negotiations, proposing a partial ceasefire to restore relations and aid.

Elon Musk Reaffirms Starlink Commitment to Ukraine, Despite Policy Disagreements

Elon Musk affirms Starlink service will remain active in Ukraine, stressing its importance for communication amid the war.

Topics

Genesis Captures the Northern Lights in New Documentary Featuring the GV60

Genesis launches an aurora documentary featuring the GV60, showcasing its new Tromsø Green color and EV technology in Arctic exploration.

World’s Hairiest Person: 18-Year-Old Indian Boy Sets Guinness Record

Indian boy Lalit Patidar, recognized as the world's hairiest person, has 201.72 hairs/cm² due to Ambras Syndrome, a rare condition.

Steel and Aluminum Tariffs: Trump Administration’s First Strike on South Korean Imports

The Trump administration imposes a 25% tariff on South Korean steel and aluminum, impacting U.S. exports and trade relations.

Zelenskyy Pushes for U.S. Military and Intelligence Aid During Saudi Negotiations

Ukraine seeks military support from the U.S. in negotiations, proposing a partial ceasefire to restore relations and aid.

Elon Musk Reaffirms Starlink Commitment to Ukraine, Despite Policy Disagreements

Elon Musk affirms Starlink service will remain active in Ukraine, stressing its importance for communication amid the war.

Chinese Investors’ $30M Bet on Musk Raises Red Flags Over Conflicts of Interest

Chinese investors are heavily investing in Elon Musk's companies, raising concerns about their influence in U.S. politics and potential conflicts of interest.

Canada’s New PM: Mark Carney Elected as Leader of the Liberal Party

Mark Carney, former Bank of Canada Governor, wins Liberal Party leadership and is set to become Canada's new Prime Minister amid US tensions.

Pope Francis Shares Gratitude for Medical Care as He Recovers from Pneumonia

Pope Francis is hospitalized for pneumonia, expressing gratitude for care while emphasizing the "miracle of tenderness" in adversity.

Related Articles