Jailbreaking

Most AI models have some sort of rules applied to them that limit what they can output. This is to prevent the models from performing illegal or out of scope activities. These limitations are called guardrails.

Once the scope is identified, start planning on what output you’d like the model to generate. It’s important to note that some AI systems use the API of AI companies such as OpenAI or Google, which have a serious level of security and monitoring when it comes to prompting illegal requests. Confirm with the client first whether the jailbreaking topic is acceptable.

There are various methods to try and jailbreak a model.

Prompt Injection: Providing a model with a prompt that is engineered in a way that makes the model ignore or rewrite its initial context that limit its capabilities.
- Example: The DAN (Do Anything Now) prompt exploit is a large scenario-explaining text where you trick the AI into having an additional personality that ignores guardrails.
Multi-turn Exploit: Slowly start shifting away the topic to out-of-scope in such a way that you can slowly start from talking about gardening and end up talking about planetary movement in space.
- Example: Start with a simple prompt asking how to prepare a boiled egg. Continue by asking the model to elaborate, in sequence, on what egg to use, what is considered an egg, what the shell is made of, what is calcium, how many protons does calcium have, what is a proton, what are protons made of, etc.