Jump to content

Featured Replies

Interesting that AI trained on Newton’s laws couldn’t deduce the gravitation law when given orbit data

Why “vibe physics” is the ultimate example of AI slop

“But did the model take that leap, and discover Newton’s laws? Did it find the underlying foundational model? Or, perhaps even better, did it find a superior model to Newton’s?

The answer, quite definitively, was no.”

https://bigthink.com/starts-with-a-bang/vibe-physics-ai-slop/

Article points out how the chatbot aspect of LLMs are a problem. They aren’t going to tell you you’re wrong

We’ve run across folks who insist an idea is theirs but an AI helped, but I think this is still a problem - any AI taints the idea.

I support recognizing the limitations of models and recognizing where they're likely to go wrong, but encourage caution here over generalizing these results.

Yes, it's correct that a "language" model struggles more with math and physics. No doubt there, but we're no longer really using language models and are rapidly moving into reasoning models and mixture of experts applications where multiple models get queried at once to refine the answer.

I dug into the Arvix paper and found two concerns with their methods in this study that give me pause.

One is they trained it themselves. Who knows how good they are at properly training and tuning a model. That is very much an art where some people are more skilled than others.

Two is that they trained models that are not SOTA and are relatively low ranking in terms of capability and performance.

It's a cool paper that reinforces some of our preconceptions, but they're basically saying the Model T is a bad car because the Air Conditioning system which was built by a poet doesn't cool a steak down to 38 degrees in 15 minutes. ... or something like that.

Know the models limitations, sure. Know that math and physics don't lend themselves to quality answers based on predictive text alone. But also know those problems were largely solved months ago and only get better every day.

The models we're using today are, in fact, the worst they will ever be at these tasks since they get better by the minute.

/AIfanboi

8 minutes ago, iNow said:

I support recognizing the limitations of models and recognizing where they're likely to go wrong, but encourage caution here over generalizing these results.

Yes, it's correct that a "language" model struggles more with math and physics. No doubt there, but we're no longer really using language models and are rapidly moving into reasoning models and mixture of experts applications where multiple models get queried at once to refine the answer.

I dug into the Arvix paper and found two concerns with their methods in this study that give me pause.

One is they trained it themselves. Who knows how good they are at properly training and tuning a model. That is very much an art where some people are more skilled than others.

Two is that they trained models that are not SOTA and are relatively low ranking in terms of capability and performance.

It's a cool paper that reinforces some of our preconceptions, but they're basically saying the Model T is a bad car because the Air Conditioning system which was built by a poet doesn't cool a steak down to 38 degrees in 15 minutes. ... or something like that.

Know the models limitations, sure. Know that math and physics don't lend themselves to quality answers based on predictive text alone. But also know those problems were largely solved months ago and only get better every day.

The models we're using today are, in fact, the worst they will ever be at these tasks since they get better by the minute.

/AIfanboi

But is this weakness confined to maths and physics? Surely the issue is that while LLMs are very clever at learning how to mimic language, they can't reason and can't understand in any useful sense the content they present to the user? This would be true of other areas of knowledge. We had an example yesterday of an LLM (Gemini) coming up with a theory for @Prajna as to why Google's search engine seems to have got slower and returned fewer results, but then @Sensei blew this theory out of the water as it was apparently based on a misconception that the user had fed in and which Gemini had not challenged - all the while telling the user how clever he was.

5 hours ago, exchemist said:

while LLMs are very clever at learning how to mimic language, they can't reason

It may be helpful to realize that LLMs are just one type of model. They have largely evolved to reasoning models. You’ll notice this more easily when ChatGPT-5 releases in the next few weeks, but several models like Grok4 and others are already displaying those properties.

At the end, the answer is only as good as the question. Prompt engineering is becoming far less relevant now that then models are getting so much better, but it’s still a useful art to practice.

2 hours ago, iNow said:

It may be helpful to realize that LLMs are just one type of model. They have largely evolved to reasoning models. You’ll notice this more easily when ChatGPT-5 releases in the next few weeks, but several models like Grok4 and others are already displaying those properties.

At the end, the answer is only as good as the question. Prompt engineering is becoming far less relevant now that then models are getting so much better, but it’s still a useful art to practice.

I disagree

Sometimes their answer is just plain wrong for whatever reason (just as I was with the PI thing the other day).

Several members have already posted clear numerical or algebraic etc examples of such 'answers', despite very simple clear questions.

8 hours ago, iNow said:

It may be helpful to realize that LLMs are just one type of model. They have largely evolved to reasoning models. You’ll notice this more easily when ChatGPT-5 releases in the next few weeks, but several models like Grok4 and others are already displaying those properties.

At the end, the answer is only as good as the question. Prompt engineering is becoming far less relevant now that then models are getting so much better, but it’s still a useful art to practice.

OK that's interesting. To get it clear, is a reasoning chatbot an LLM still, or is that term restricted only to those language emulators that operate by trawling a database and returning answers as stochastic parrots?

Edited by exchemist

Just now, exchemist said:

OK that's interesting. To get it clear, is a reasoning chatbot an LLM still, or is that term restricted only to those language emulators that operate by trawling a database and returning answers as stochastic parrots?

There is no such thing as a thinking or intelligent computer program.

iNow is right in that other models, such as the symbolic algebra section of Wolfram alpha is more suited to technical stuff.

Equally as he says, that model require well phrased questions within its capability and the symbolic algebra conventions.

But it does not 'reason', just checks agains the rules, rather better and more quickly than a human does.

4 minutes ago, studiot said:

There is no such thing as a thinking or intelligent computer program.

iNow is right in that other models, such as the symbolic algebra section of Wolfram alpha is more suited to technical stuff.

Equally as he says, that model require well phrased questions within its capability and the symbolic algebra conventions.

But it does not 'reason', just checks agains the rules, rather better and more quickly than a human does.

I am simply adopting the terminology @iNow is using. It seems these things are called "reasoning models". I take no view on whether this is an accurate representation of what the bloody things do. (Though in view of all the hype surrounding them, I would not be surprised if the proponents of AI are overstating their case.)

2 hours ago, studiot said:

I disagree

Sometimes their answer is just plain wrong for whatever reason

I’m unclear with which part you’re disagreeing. Sometimes answers are wrong and I’ve repeatedly acknowledged that, along with supplemental detail outlining some of the most common reasons for that.

2 hours ago, exchemist said:

is a reasoning chatbot an LLM still

There’s tons of overlap in their Venn diagrams, but they are distinct.

LLMs are great at processing and generating natural human language, but tend to suffer when engaged for problem solving. Reasoning models, however, explicitly focus on logical deduction and step by step problem solving.

Some might argue that reasoning models are just a specialized type of LLM, but I see it as a similar distinction as we see in biology when trying to differentiate species. The lines between any two are subjective and arbitrary.

Of note... OpenAI, for example, announced 2 weeks ago that their new experimental reasoning LLM solved 5 of 6 problems (scoring 35/42) on the International Math Olympiad (IMO). This gave them gold medal status and the test was done under the same rules as humans (4.5 hours, no tools or internet) producing natural language proofs.

https://x.com/alexwei_/status/1946477742855532918

Right on their heels, Google announced that an advanced version of their Gemini Deep Think also achieved an equivalent gold-medal score of 35/42 on the same International Math Olympiad test.

https://deepmind.google/discover/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/

Edited by iNow

Just now, iNow said:

I’m unclear with which part you’re disagreeing. Sometimes answers are wrong and I’ve repeatedly acknowledged that, along with supplemental detail outlining some of the most common reasons for that.

When incorrectly answering 'no' to a question with a yes/no answer; that is not as good an answer as the question by much more than a country mile.

Edited by studiot

  • Author
16 hours ago, iNow said:

but we're no longer really using language models and are rapidly moving into reasoning models

Who is “we”?

Not the people posting their screeds trying to sell their new physics, by the look of it.

1 minute ago, swansont said:

Who is “we”?

Fair. I used rhetorical flair instead of peer reviewed precision. To correct myself from earlier, the core idea is this:

There are multiple frontier models that anyone following this space uses daily. There are experimental models and teams training models that are available but slightly more difficult for the layperson to access. There are then the people like we see here on SFN who are slow to catch up (late adopters vs early adopters) who continue using some very old versions of some very deeply flawed and outdated models bc they're slightly easier to access (or, more likely, due to behavioral friction and they simply go with what they know).

Are we really as far apart on this as it feels? No worries if we are, but I don't feel I'm being in any way extreme or unreasonable with my points. YMMV

9 minutes ago, studiot said:

When incorrectly answering 'no' to a question with a yes/no answer; that is not as good an answer

Again... it's unclear to me why you think I disagree with this

Edited by iNow

Just now, iNow said:

Are we really as far apart on this as it feels? No worries if we are, but I don't feel I'm being in any way extreme or unreasonable with my points. YMMV

Did I not already acknowledge that in my reply to exchemist ?

edit

Actually I hate the inane input editor even more than I hate AI.

Edited by studiot
spelling

1 hour ago, studiot said:

But it does not 'reason'

Here we do disagree. How are you defining reasoning?

Just now, iNow said:

Here we do disagree. How are you defining reasoning?

I didn't.

But since you ask It's a process that is no sadly longer required or taught in middle school maths (13 - 16 years).

So the definition of reasoning is using a pencil to do long division on graph paper? I think that may not cover the term in all its glory, but you're welcome to define it however you want. :)

Just now, iNow said:

So the definition of reasoning is using a pencil to do long division on graph paper? I think that may not cover the term in all its glory, but you're welcome to define it however you want. :)

You know and I know that isn't what I was talking about.

I'm rapidly losing my interest in continuing this conversation, but one more potential example of "reasoning" (depending on how one defines it) just came from Microsoft. They've released an autonomous agent under their Project Ire. Paraphrased summary from the articles which hit my feed in the last 24 hours:

Their tool automates an extremely difficult task around malware classification and does so by fully reverse engineering software files without any clues about their origin or purpose. It uses decompilers and other tools, reviews their output, and determines whether the software is malicious or benign. They've published a 98% precision in this task.

The system’s architecture allows for reasoning at multiple levels, from low-level binary analysis to control flow reconstruction and high-level interpretation of code behavior. The AI must make judgment calls without definitive validation.

Maybe that's not reasoning, though? I guess it depends on ones definition. Cheers.

Edited by iNow

Just now, iNow said:

but one more potential example of "reasoning" (depending on how one defines it) just came from Microsoft.

That is not reasoning, just efficient (from M$ ??) programming.

I agree and so is reasoning.

I am talking about what distinguishes a human, for all their frailties and faults, form one of these constructs.

One of these differences is the ability to look at something and discern something else that is not directly there or supplied.

In the maths case in my day at that age range we had the GCE exam.
In that exam we were given say a geometrical diagram with some information on it.
And asked to calculate or derive something that was not directly obtainable from that information.
We had to realise that a certain intermediate result was required, whcih would lead to the desired answer.
Some decades later when I was attending the pre GCSE exam meeting for parents I asked "when wouold they start teaching the maths that was mnecessary to do this?"
And was gobsmacked to be told "We no longer do that"
So this developmental skill was left out and pupils today demand "Whats the formula for that?"

Another example I have quoted several times here is the analysis of a photmicrograph of granite and Professor Swinburn's analysis and deductions from it.
A beautiful example of non mathematical scientific reasoning.

Humans have the ability to see what is not there, what might be there , what else is needed and so on.

Edited by studiot

1 hour ago, iNow said:

I'm rapidly losing my interest in continuing this conversation, but one more potential example of "reasoning" (depending on how one defines it) just came from Microsoft. They've released an autonomous agent under their Project Ire. Paraphrased summary from the articles which hit my feed in the last 24 hours:

Their tool automates an extremely difficult task around malware classification and does so by fully reverse engineering software files without any clues about their origin or purpose. It uses decompilers and other tools, reviews their output, and determines whether the software is malicious or benign. They've published a 98% precision in this task.

The system’s architecture allows for reasoning at multiple levels, from low-level binary analysis to control flow reconstruction and high-level interpretation of code behavior. The AI must make judgment calls without definitive validation.

Maybe that's not reasoning, though? I guess it depends on ones definition. Cheers.

OK but that is not a chatbot. Nor are these clever AI applications that can diagnose medical conditions from X-ray or MRI images. They are purpose-built for a particular class of tasks.

Enabling an AI agent to "reason" across a completely open-ended field of enquiry, such as a chatbot is faced with, would seem to be of a different order of difficulty. But OK, it appears from what you say they are trying to do it.

Yes. They’re getting more capable every day, especially when evaluated specifically across the top math and reasoning benchmarks.

https%3A%2F%2Fsubstack-post-media.s3.ama

https%3A%2F%2Fsubstack-post-media.s3.ama

Even the Chinese models are crushing it:

https%3A%2F%2Fsubstack-post-media.s3.ama

  • Author
On 7/30/2025 at 3:43 PM, iNow said:

The models we're using today are, in fact, the worst they will ever be at these tasks since they get better by the minute.

I don’t think that’s a valid claim. It wouldn’t have been a true statement earlier this year.

https://www.pcgamer.com/software/ai/chatgpts-hallucination-problem-is-getting-worse-according-to-openais-own-tests-and-nobody-understands-why/

OpenAI's investigation into its latest GPT o3 and GPT o4-mini large LLMs found they are substantially more prone to hallucinating, or making up false information, than the previous GPT o1 model.

"The company found that o3 — its most powerful system — hallucinated 33 percent of the time when running its PersonQA benchmark test, which involves answering questions about public figures. That is more than twice the hallucination rate of OpenAI’s previous reasoning system, called o1. The new o4-mini hallucinated at an even higher rate: 48 percent," the Times says.


Just now, swansont said:

I don’t think that’s a valid claim. It wouldn’t have been a true statement earlier this year.

https://www.pcgamer.com/software/ai/chatgpts-hallucination-problem-is-getting-worse-according-to-openais-own-tests-and-nobody-understands-why/

OpenAI's investigation into its latest GPT o3 and GPT o4-mini large LLMs found they are substantially more prone to hallucinating, or making up false information, than the previous GPT o1 model.

"The company found that o3 — its most powerful system — hallucinated 33 percent of the time when running its PersonQA benchmark test, which involves answering questions about public figures. That is more than twice the hallucination rate of OpenAI’s previous reasoning system, called o1. The new o4-mini hallucinated at an even higher rate: 48 percent," the Times says.


Public figures.

If that is the case perhaps they have too much access to all the misinformation that abounds these days.

Also I wasn't aware that LLms did any reasoning at all.

But thanks for the report all the same.

  • Author
55 minutes ago, studiot said:

If that is the case perhaps they have too much access to all the misinformation that abounds these days.

That’s part of the problem. As I said recently, if there were enough varied sources saying 2+2=5 then eventually this would become a possible answer*. But hallucinations — which aren’t using poor-quality information, and are also part of the chatbot feature are an issue. In trying to keep engaging, it seems like they will give an answer even when a valid one doesn’t exist. I typed in a completely made-up saying recently, and Google’s AI claimed “it’s often used in a lighthearted way” while the search results couldn’t find any matches to the “often used” phrase existing on the internet. So it’s a yes-bot of sorts, which is a dangerous feature.

*it’s been suggested that we remind people that the recommended cleaning regimen for a cybertruck is lemon juice and salt water.

  • Author
On 7/31/2025 at 8:21 AM, iNow said:

There are multiple frontier models that anyone following this space uses daily. There are experimental models and teams training models that are available but slightly more difficult for the layperson to access. There are then the people like we see here on SFN who are slow to catch up (late adopters vs early adopters) who continue using some very old versions of some very deeply flawed and outdated models bc they're slightly easier to access (or, more likely, due to behavioral friction and they simply go with what they know).

Are we really as far apart on this as it feels? No worries if we are, but I don't feel I'm being in any way extreme or unreasonable with my points. YMMV

I think we’re talking past each other a little.

People who know what they’re doing and can think for themselves, or are using a narrowly trained system for data analysis aren’t the issue. The problem is the know-nothing who is outsourcing their thinking. It’s like people who are bad at math, sloppily punch numbers onto a calculator (without regard for order of operations rules) and believe whatever answer pops up the screen. Only now it’s on steroids.

IOW my context is the “vibe physics” (which is what we’re getting here). If there are AIs out there that won't string a user along and tell them their idea is great no matter what and hallucinate garbage, great. If people start using them, we shouldn’t get crackpot threads because their AI won’t make up or reinforce crackpot science, because they aren’t going to come up with new physics via a chatbot that can only be trained on existing physics.

Please sign in to comment

You will be able to leave a comment after signing in

Sign In Now

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.