Yes, ChatGPT, Claude, others, ’hallucinate’. Yes, they ’lie’. Yes, they’re just spitting out the statistically most relevant sequence of words – from the multi-dimensional region your own sequence of words happened to activate.
Still, they are more often than not very useful – far beyond merely ’generating’ text.
Also, they are, at the same time, familiar and very unfamiliar. We’re chatting with a completely new kind of entity, which appears human-like. But we can’t really tell when it responds with something incorrect and possibly misleading.
’ChatGPT can make errors. Check important info.’
An LLM might respond entirely correctly, because there was plenty of text about what you prompted it about – and your choice of words successfully activated the right multi-dimensional region – so the statistically relevant sequence of words you got back were something you could rely on.
(It might also have been the case that you asked the LLM about something for which a lot of human effort had been spent in post-training, to improve the quality of responses – likely because there were leaderboards ranking LLMs for inferences about those things.)
But the LLM might respond inaccurately not just due to the lack of training data (or post-training efforts). It could have been your choice of words. It could have been that you went on a tangent on something earlier in the chat session, that steered the LLM into another multi-dimensional region, where another (equally) statistically relevant sequence of words were fed back to you.
When ChatGPT, Claude, others, respond inaccurately – you can often not tell. The LLM itself can’t tell. What it does is to return text. It will always do so. And it will always feel as familiar as when they respond accurately.
Have LLMs gotten better in this regard? Yes. Will they continue to get better? I would think so. But even with GPT-5 – our newest frontier model – this still happens. Will we ever get models where we can trust all responses all the time? We’ll see.