Multiple-Choice Benchmarks, Verifiers, Leaderboards, and LLM Judges with Code Examples