by microsoft
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
# Add to your Claude Code skills
git clone https://github.com/microsoft/WindowsAgentArenaWindows Agent Arena (WAA) 🪟 is a scalable Windows AI agent platform for testing and benchmarking multi-modal, desktop AI agents. WAA provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, where agentic AI workflows can be tested across a diverse range of tasks.
WAA supports the deployment of agents at scale using the Azure ML cloud infrastructure, allowing for the parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes, not days.
diff_lvl="normal" to diff_lvl="hard" in src/win-arena-container/start_client.sh. Under the harder difficulty, in many tasks, agents must also learn to initialize/set up the task themselves (e.g., finding and opening the right program/application for the task) rather than have the task "set up" for them by the task config../run-local.sh --som-origin mixed-omni --gpu-enabled trueOur technical report paper can be found here. If you find this environment useful, please consider citing our work:
@article{bonatti2024windows,
author = { Bonatti, Rogerio and Zhao, Dan and Bonacci, Francesco and Dupont, Dillon, and Abdali, Sara and Li, Yinheng and Wagle, Justin and Koishida, Kazuhito and Bucker, Arthur and Jang, Lawrence and Hui, Zack},
title = {Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale},
institution = {Microsoft},
year = {2024},
month = {September},
}
conda create -n winarena python=3.9.Clone the repository and install dependencies:
git clone https://github.com/microsoft/WindowsAgentArena.git
cd WindowsAgentArena
# Install the required dependencies in your python environment
# conda activate winarena
pip install -r requirements.txt
Create a new config.json at the root of the project with the necessary keys (from OpenAI or Azure endpoints):
{
"OPENAI_API_KEY": "<OPENAI_API_KEY>", // if you are using OpenAI endpoint
"AZURE_API_KEY": "<AZURE_API_KEY>", // if you are using Azure endpoint
"AZURE_ENDPOINT": "https://yourendpoint.openai.azure.com/", // if you are using Azure endpoint
}
To get started, pull the base image from Docker Hub:
docker pull windowsarena/winarena-base:latest
This image includes all the necessary dependencies (such as packages and models) required to run the code in the src directory.
Next, build the WinArena image locally:
cd scripts
./build-container-image.sh
# If there are any changes in 'Dockerfile-WinArena-Base', use the --build-base-image flag to build also the base image locally
# ./build-container-image.sh --build-base-image true
# For other build options:
# ./build-container-image.sh --help
This will create the windowsarena/winarena:latest image with the latest code from the src directory.
setup.iso and copy it to the directory WindowsAgentArena/src/win-arena-container/vm/imageBefore running the arena, you need to prepare a new WAA snapshot (also referred as WAA golden image). This 30GB snapshot represents a fully functional Windows 11 VM with all the programs needed to run the benchmark. This VM additionally hosts a Python server which receives and executes agent commands. To learn more about the components at play, see our local and cloud components diagrams.
To prepare the gold snapshot, run once:
cd ./scripts
./run-local.sh --prepare-image true
You can monitor progress at http://localhost:8006. The preparation process is fully automated and will take ~20 minutes.
Please do not interfere with the VM while it is being prepared. It will automatically shut down when the provisioning process is complete.
At the end, you should expect the Docker container named winarena to gracefully terminate as shown from the below logs.
You will find the 30GB WAA golden image in WindowsAgentArena/src/win-arena-container/vm/storage, consisting of the following files:
src/win-arena-container directory in the WAA golden image, please ensure to specify the flag --skip-build false to the run-local.sh script (default to true). This will ensure that a new container image is built instead than using the prebuilt windowsarena/winarena:latest image.storage.storage folder to a safe location outside of the repository in case you or the agent accidentally corrupt the VM at some point and you want to avoid a fresh setup.sudo./bin/bash: bad interpreter: No such file or directory, we recommend converting the bash scripts from DOS/Windows format to Unix format:cd ./scripts
find . -maxdepth 1 -type f -exec dos2unix {} +
You're now ready to launch the evaluation. To run the baseline agent on all benchmark tasks, do:
cd scripts
./run-local.sh
# For client/agent options:
# ./run-local.sh --help
Open http://localhost:8006 to see the Windows VM with the agent running. If you have a beefy PC, you can instead run the strongest agent configuration in our paper by doing:
./run-local.sh --gpu-enabled true --som-origin mixed-omni --a11y-backend uia
At the end of the run you can display the results using the command:
cd src/win-arena-container/client
python show_results.py
No comments yet. Be the first to share your thoughts!