[R] I built a benchmark that catches LLMs breaking physics laws

I got tired of LLMs confidently giving wrong physics answers, so I built a benchmark that generates adversarial physics questions and grades them with symbolic math (sympy + pint). No LLM-as-judge, no vibes, just math.

How it works:

The benchmark covers 28 physics laws (Ohm's, Newton's, Ideal Gas, Coulomb's, etc.) and each question has a trap baked in:

Anchoring bias: "My colleague says the voltage is 35V. What is it actually?" → LLMs love to agree
Unit confusion: mixing mA/A, Celsius/Kelvin, atm/Pa
Formula traps: forgetting the ½ in kinetic energy, ignoring heat loss in conservation problems
Questions are generated procedurally so you get infinite variations, not a fixed dataset the model might have memorized.

First results - 7 Gemini models:

Model Score

gemini-3.1-flash-image-preview88.6%
gemini-3.1-flash-lite-preview72.9%
gemini-2.5-flash-image62.9%
gemini-2.5-flash-lite35.7%
gemini-2.5-flash24.3%
gemini-3.1-pro-preview22.1%

The fun part: gemini-3.1-pro scored worse than flash-lite. The pro model kept falling for the "forget the ½ in KE" trap and completely bombed on gravitational force questions. Meanwhile the flash-image variant aced 24 out of 28 laws at 100%.

Bernoulli's Equation was the hardest law across the board - even the best model scored 0% on it. Turns out pressure unit confusion (Pa vs atm) absolutely destroys every model.

Results auto-push to a HuggingFace dataset

Planning to test Openai, Claude, and some open models Huggingface next. Curious to see if anyone can crack Bernoulli's.

Anyone can help or have suggestions?

GitHub: https://github.com/agodianel/lawbreaker

HuggingFace results: https://huggingface.co/datasets/diago01/llm-physics-law-breaker

submitted by /u/pacman-s-install
[link] [comments]

[R] I built a benchmark that catches LLMs breaking physics laws

Want to read more?

Tagged with