Ethan Michalak

Hands-on workshop : MITRE iCaldera: Purple Teaming in the Future

10:00-12:00 PDT Saturday | Aug 9^th 2025
Adversary Village workshop stage | Las Vegas Convention Center
Purple Team
Adversary Emulation

The rapid advancement of large language models (LLMs) is reshaping the landscape of cybersecurity. These models are not only achieving higher benchmarks in math, coding, and cybersecurity tasks but are also being leveraged by threat actors to enhance resource development and social engineering capabilities. As LLMs continue to evolve, what could autonomous cyber capabilities powered by these models look like? How can we responsibly harness their potential for adversary emulation and defense?
In this talk, we will explore the integration of LLMs into MITRE Caldera, a scalable automated adversary emulation platform, and investigate how these models can transform adversary emulation through three distinct paradigms: as planners, as factories for constructing custom cyber abilities, and as forward-deployed autonomous agents. Drawing on existing research, including papers on LLM-assisted malware development and benchmarks for offensive cyber operations, we will examine the capabilities of LLMs in generating plausible emulations of advanced persistent threats (APTs).

The session will feature live demonstrations showcasing how LLMs can replicate adversary profiles, construct new cyber abilities on the fly, and autonomously execute emulation tasks. Attendees will gain insights into the performance of these paradigms, their implications for purple teaming, and the challenges of maintaining realistic emulations.
Finally, we will look ahead to the future of adversary emulation, discussing how APTs might leverage autonomous or semi-autonomous LLM capabilities in practice and the role of increasingly powerful models in shaping the next generation of cybersecurity tools. Whether you're a defender, researcher, or technologist, this talk will provide a compelling glimpse into the possibilities and risks of LLM-enabled adversary emulation.

Detailed workshop outline :

1. Introduction
- a. Personal Introductions
- b. Large language models are getting better and better
  - i. LLM models showcase higher scores on math, coding, and cybersecurity benchmarks over time.
  - ii. Evidence suggests large language models are currently being used by threat actors to enhance capabilities mostly in resource development and social engineering.
  - iii. Where do we go from here? (What would LLM enabled autonomous cyber capabilities look like)
- c. Presentation thesis statement
  - i. We can explore several iterations of autonomous capability with large language models, either through adjusting control of the Caldera planner, giving control of the Caldera agent to create an LLM agent, or to construct new cyber abilities on the fly.
2. Existing Research
- a. Papers/Talks
  - i. Paper: Exploring the impact of LLM Assisted Malware Variants on Anti-Virus Detection
    - ia. Large language models show the capability for developing offensive tradecraft
    - ib. https://ieeexplore.ieee.org/abstract/document/10763573
- b. SANS youtube talk
  - i. Purple Teaming talk showcasing integration of LLMs into caldera
  - ii. https://www.youtube.com/watch?v=2wJvfx9hGaw
- c. MITRE OCCULT
  - i. Offensive cyber operation benchmark for large language models
  - ii. https://arxiv.org/pdf/2502.15797
3. What is Caldera and how does it currently function?
- a. Caldera is a scalable, automated adversary emulation platform
  - i. autonomous adversary emulation / press "go" style
  - ii. testing of EDR/XDR / ATT&CK Evals
  - iii. purple team style of testing
- b. Caldera current capabilities and functions
  - i. agents, adversary profiles, abilities linked to ATT&CK
  - ii. example adversary profile "Thief" in detail
4. Large Language Models in Caldera
- a. Different Paradigms for autonomous functionality
  - i. Large Language models as a planner(Instructing and directing)
  - ii. Large language models as a factory(Constructing custom abilities(commands and cyber capability) on the fly).
  - iii. Forward deployed large language model agent
    - iiia. Instructing itself and directing itself
    - iiib. Guidance based on initial deployed goal, loosely coupled with C2(caldera). Reaches out once its accomplished its goal
    - iiic. Creates own abilities
- b. Demo
  - i. Planner demo, instructed to replicate original thief adversary profile
  - ii. Factory demo, instructed to accomplish goal outside scope of abilities already present
  - iii. Agent demo, instructed to replicate specified APT profile
- c. Results
  - i. How did the different paradigms perform? (was the goal accomplished while maitaining a plausible emulation)
    - ia. Planner experiment analysis results
    - ib. Factory experiment analysis results
    - ic. Agent experiment analysis results
5. Looking forward to large language models in adversary emulation
- a. What will APTs seek to use in practice?
  - i. Autonomous vs semi-Autonomous use cases
- b. Models will likely become even better in the future
  - i. Increase in model performance will likely influence autonomous purple teaming in the future

[Speakers]
Adversary Village at
DEF CON 33

Hands-on workshop : MITRE iCaldera: Purple Teaming in the Future

Join Adversary Village Discord Server.

[Speakers]Adversary Village at DEF CON 33

Ethan Michalak

Hands-on workshop : MITRE iCaldera: Purple Teaming in the Future

Join Adversary Village Discord Server.

[Speakers]
Adversary Village at
DEF CON 33