2 min readfrom Machine Learning

[R] I built a benchmark that catches LLMs breaking physics laws

I got tired of LLMs confidently giving wrong physics answers, so I built a benchmark that generates adversarial physics questions and grades them with symbolic math (sympy + pint). No LLM-as-judge, no vibes, just math.

How it works:

The benchmark covers 28 physics laws (Ohm's, Newton's, Ideal Gas, Coulomb's, etc.) and each question has a trap baked in:

  • Anchoring bias: "My colleague says the voltage is 35V. What is it actually?" → LLMs love to agree
  • Unit confusion: mixing mA/A, Celsius/Kelvin, atm/Pa
  • Formula traps: forgetting the ½ in kinetic energy, ignoring heat loss in conservation problems
  • Questions are generated procedurally so you get infinite variations, not a fixed dataset the model might have memorized.

First results - 7 Gemini models:

Model Score

  • gemini-3.1-flash-image-preview88.6%
  • gemini-3.1-flash-lite-preview72.9%
  • gemini-2.5-flash-image62.9%
  • gemini-2.5-flash-lite35.7%
  • gemini-2.5-flash24.3%
  • gemini-3.1-pro-preview22.1%

The fun part: gemini-3.1-pro scored worse than flash-lite. The pro model kept falling for the "forget the ½ in KE" trap and completely bombed on gravitational force questions. Meanwhile the flash-image variant aced 24 out of 28 laws at 100%.

Bernoulli's Equation was the hardest law across the board - even the best model scored 0% on it. Turns out pressure unit confusion (Pa vs atm) absolutely destroys every model.

Results auto-push to a HuggingFace dataset

Planning to test Openai, Claude, and some open models Huggingface next. Curious to see if anyone can crack Bernoulli's.

Anyone can help or have suggestions?

GitHub: https://github.com/agodianel/lawbreaker

HuggingFace results: https://huggingface.co/datasets/diago01/llm-physics-law-breaker

submitted by /u/pacman-s-install
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#rows.com
#large dataset processing
#natural language processing for spreadsheets
#AI formula generation techniques
#generative AI for data analysis
#Excel alternatives for data analysis
#financial modeling with spreadsheets
#formula generator
#LLM
#physics laws
#benchmark
#Bernoulli's Equation
#symbolic math
#unit confusion
#adversarial questions
#gravitational force
#HuggingFace
#pressure unit confusion
#anchoring bias
#Coulomb's law