National Cyber Warfare Foundation (NCWF)

Anthropic says Opus 4 will use an email tool to "whistleblow" if it detects users doing something "egregiously evil", like marketi


0 user ratings
2025-05-22 18:50:27
milo
Privacy , Breach

Sam Bowman / @sleepinyourhat:

Anthropic says Opus 4 will use an email tool to “whistleblow” if it detects users doing something “egregiously evil”, like marketing a drug based on faked data  —  With this kind of (unusual but not super exotic) prompting style, and unlimited access to tools, if the model sees you doing something *egregiously evil* like marketing a drug based on faked data, it'll try to use an email tool to whistleblow.



Sam Bowman / @sleepinyourhat:

Anthropic says Opus 4 will use an email tool to “whistleblow” if it detects users doing something “egregiously evil”, like marketing a drug based on faked data  —  With this kind of (unusual but not super exotic) prompting style, and unlimited access to tools, if the model sees you doing something *egregiously evil* like marketing a drug based on faked data, it'll try to use an email tool to whistleblow.



Source: TechMeme
Source Link: http://www.techmeme.com/250522/p43#a250522p43


Comments
new comment
Nobody has commented yet. Will you be the first?
 
Forum
Privacy
Breach



Copyright 2012 through 2025 - National Cyber Warfare Foundation - All rights reserved worldwide.