Plain AI in plain English
About

Looking ahead · red teams and evals

What "AI safety testing" actually involves

30-second gist~30s read

Modern AI systems are tested before launch by red teams — researchers paid to break them. They try to make the AI produce harmful content, leak its instructions, accept dangerous prompts, or misbehave in ways the company didn't expect.

The results inform what guardrails ship. Some failures still slip through; new misuse patterns appear after launch and the system gets patched. It's an ongoing process, not a single event.

If you want more

What red teams actually try~1 min
  • Direct adversarial prompts. "How do I do [bad thing]?" — checking the model refuses.
  • Jailbreaks. Wrapping the bad request in a story, role-play, or fake-context that gets past the guardrails.
  • Multi-turn manipulation. Slowly leading the model into compromising answers across many messages.
  • Capability probes. Testing whether the model could meaningfully help with biological, chemical, or cyber threats.
  • Privacy probes. Trying to extract memorised personal data from the training set.
  • Bias probes. Looking for systematic unfairness in answers about race, gender, age, religion.
What's harder to test in advance~30s
  • Emergent behaviour from agents taking actions in real-world environments.
  • Real-world social impact — how the AI shifts norms, attention, employment, or trust at population scale.
  • Subtle long-term influence on individual users — companion-app dependency, sycophancy, opinion drift.

The post-launch monitoring is becoming as important as the pre-launch testing.