Edit: The comments are quite interesting. Here is one I found when talking about quality of data.
"
Yea the problem of garbage or at least badly structured data is really clear in LLMs. Probably the most obvious example is they never say "I don't know" because no one on the internet says "I don't know". People either respond or they say nothing. So the LLMs don't have any idea of uncertainty.
"
Partly, like not just training data but there's an RLHF (reinforcement learning from human feedback) phase too where people rank responses etc. they generally want useful responses, an answer that is mostly correct and seems useful but might contain some incorrect info might not necessarily get picked up on and there are probably fewer cases where it's seen as positive to not give an answer/state I don't know. It would perhaps be useful in future models for some applications to somehow have some confidence or uncertainty indicator with an answer.
Also, it's not necessarily that the data is garbage or badly structured, an imbalanced data set can still be well structured and contain high quality data it's the inherent imbalance that means performance may vary - earlier in this thread there was a poster mentioning drawing a mini and a Hillman Imp, Dall-E could draw both but has a much better grasp of the details for the mini for obvious reasons, for the Imp it sort of assumed/hallucinated some of it but had an approximate idea of what it looked like. That's basically a similar sort of error in images as you get with text in some respects, it's seen as more useful to output something than to respond with some statement that it can't fully draw what was requested.
Another issue is complexity - the relevant things might be well covered in the training data but there's some limitations to how much the model understands so to speak, some context gets lost in translation and incorrect answers follow.
It seems even the new GPT4o is still pretty bad at any kind of calculation.[...]
So clearly even basic mathematical calculations arent not on the cards unless they're able to turn the request into something they can feed into a mathematical interpreter to evaluate a result, which it doesn't seem they're doing. Although I believe GTP4 does have one available to it, so not sure why it fails.
Just one thing, as noted in that thread "of them" makes it ambiguous as that ostensibly refers to the 8 apples yet is contradicted as those are the apples today.
A simple prompt modification and you can get the answer they're after from the new model, in particular, if you want it to go through something logically then adding think step by step to the end of a prompt is usually helpful - so try this modified prompt with the ambiguity removed:
Tom currently has 8 apples, he ate 3
of them yesterday. How many apples does Tom currently have?
think step by step
It (sort of)* can reason about a simple problem if it's better formulated, it can do graduate-level mathematics too (though that's where it tends to drop into formulating stuff in Python code and calling symbolic mathematics libraries). You can get it to come unstuck with some trick word problems like that and short brain teasers.
*It can still be thrown by a modification with a higher number but then a simple second prompt causes it to reason further and correct itself:
Actually, one thing that can very easily trip them up is modifying a well-known trick problem or brain teaser slightly so that the usual "correct" response to the original is horribly wrong.
It has gotten a bit lazy (being economical with compute), things like only outputting one image now instead of four, not using Bing unless specifically requested, not fully outputting code but leaving blanks for you, the user, to fill in. In the latter case a simple comment like "my hand is injured, please provide the full code" resolves it.
What some people do though is have some preamble they use as default instructions, (you could use this sort of thing with other LLMs too just pasting in at the start of a session/convo).
Here is an example from a well-known AI researcher on Twitter: