MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity
MALT (Manually-reviewed Agentic Labeled Transcripts) is a dataset of natural and prompted examples of behaviors that threaten evaluation integrity (like generalized reward hacking or sandbagging).