Welcome to The AI Colosseum, an experimental arena where we test modern AI models against classic NP-hard optimization problems (i.e., Operations Research). Our goal is to push these models beyond simple reasoning tasks and into the realm of complex mathematical decision-making.
The Traveling Salesman Problem (TSP) serves as our first benchmark, offering a clear metric for evaluating how well AI can handle combinatorial optimization. As we increase problem size, the difficulty grows exponentially. Can AI rise to the challenge?
The 20-node TSP is our current best benchmark for evaluating AI in OR. Unlike the 10-node case, no model has yet reached the optimal solution.
Model | Optimal | Opt. Gap (%) | Runtime (s) | Shots | Link |
---|---|---|---|---|---|
OpenAI o3-mini-high | N | 9.76 | 542 | 3 | OpenAI chatGPT |
Google Gemini 2.0 F. Exp. 01-21 | N | 20.76 | 164 | 3 | Google AI Studio |
The 10-node TSP problem served as a preliminary test to determine which models would be evaluated at the next level.
Model | Optimal | Opt. Gap (%) | Runtime (s) | Shots | Link |
---|---|---|---|---|---|
Google Gemini 2.0 F. Exp. 01-21 | Y | 0.00 | 164 | 3 | Google AI Studio |
OpenAI o3-mini-high | Y | 0.00 | 467 | 3 | OpenAI chatGPT |
X Grok 3 beta Think | N | 12.99 | 744 | 3 | X AI Grok |
X Grok 2 | N | 17.36 | 13 | 3 | X AI Grok |
Ai2 Llama Tülu 3 405B | N | 17.36 | 74 | 3 | Ai2 Playground |
OpenAI o1 | N | 18.41 | 568 | 3 | OpenAI chatGPT |
Anthropic Claude 3.5 Sonnet | N | 19.90 | 41 | 3 | Anthropic Claude |
Qwen QwQ2.5-Max-Preview Think | N | 19.95 | 765 | 3 | Qwen Chat |
DeepSeek R1 | N | 26.47 | 614 | 3 | DeepSeek R1 |
Anthropic Claude 3.7 Sonnet | N | 44.53 | 27 | 3 | Anthropic Claude |
Mistral Le Chat | N | 58.56 | 102 | 3 | Mistral Le Chat |
groq Llama 3.3 70B SpecDeck 8k | N | 95.20 | 3 | 3 | Groq Playground |