The AI Colosseum

Welcome to The AI Colosseum, an experimental arena where we test modern AI models against classic NP-hard optimization problems (i.e., Operations Research). Our goal is to push these models beyond simple reasoning tasks and into the realm of complex mathematical decision-making.

TSP

The Traveling Salesman Problem (TSP) serves as our first benchmark, offering a clear metric for evaluating how well AI can handle combinatorial optimization. As we increase problem size, the difficulty grows exponentially. Can AI rise to the challenge?

TSP - 20 nodes up to 3 shots

The 20-node TSP is our current best benchmark for evaluating AI in OR. Unlike the 10-node case, no model has yet reached the optimal solution.

Model	Optimal	Opt. Gap (%)	Runtime (s)	Shots	Link
OpenAI o3-mini-high	N	9.76	542	3	OpenAI chatGPT
Google Gemini 2.0 F. Exp. 01-21	N	20.76	164	3	Google AI Studio

TSP - 10 nodes up to 3 shots

The 10-node TSP problem served as a preliminary test to determine which models would be evaluated at the next level.

Model	Optimal	Opt. Gap (%)	Runtime (s)	Shots	Link
Google Gemini 2.0 F. Exp. 01-21	Y	0.00	164	3	Google AI Studio
OpenAI o3-mini-high	Y	0.00	467	3	OpenAI chatGPT
X Grok 3 beta Think	N	12.99	744	3	X AI Grok
X Grok 2	N	17.36	13	3	X AI Grok
Ai2 Llama Tülu 3 405B	N	17.36	74	3	Ai2 Playground
OpenAI o1	N	18.41	568	3	OpenAI chatGPT
Anthropic Claude 3.5 Sonnet	N	19.90	41	3	Anthropic Claude
Qwen QwQ2.5-Max-Preview Think	N	19.95	765	3	Qwen Chat
DeepSeek R1	N	26.47	614	3	DeepSeek R1
Anthropic Claude 3.7 Sonnet	N	44.53	27	3	Anthropic Claude
Mistral Le Chat	N	58.56	102	3	Mistral Le Chat
groq Llama 3.3 70B SpecDeck 8k	N	95.20	3	3	Groq Playground