LLMs are solving MCAT, the bar test, SAT etc like they’re nothing. At this point their performance is super human. However they’ll often trip on super simple common sense questions, they’ll struggle with creative thinking.
Is this literally proof that standard tests are not a good measure of intelligence?
Citation needed that LLMs are passing these tests like they’re nothing.
LLMs don’t have intelligence, they are sentence generators. Sometimes those sentences are correct, sometimes they’re gobbledygook.
For instance, they fabricate real-looking but nevertheless totally fake citations in research papers https://www.nature.com/articles/s41598-023-41032-5
To your point we already know standardized tests are biased and poor tools to measure intelligence. Partly that’s because they don’t actually measure intelligence- they often measure rote knowledge. We don’t need LLMs to make that determination, we already can.
Talked about this a few times over the last few weeks but here we go again…
I am teaching myself to write and had been using chatgpt for super basic grammar assistance. Seemed like an ideal thing, toss a sentence I was iffy about into it and ask it what it thought. After all I wasn’t going to be asking it some college level shit. A few days ago I asked it about something I was questionable on. I honestly can’t remember the details but it completely ignored the part of the sentence I wasn’t sure about and told me something else was wrong. What it said was wrong was just…not wrong. The ‘correction’ it gave me was some shit a third grader would look at and say ‘uhhhhh…I’m gonna ask someone else now…’
That’s because LLMs aren’t intelligent. They’re just parrots that repeat what they’ve heard before. This stuff being sold as an “AI” with any “intelligence” is extremely misleading and causing people to think it’s going to be able to do things it can’t.
Case in point, you were using it and trusting it until it became very obvious it was wrong. How many people never get to that point? How much has it done wrong before then? Etc.
OP picked standardized tests that only require memorization because they have zero idea what a real IQ test like the WAIS is like.
Also how those IQ tests work. You kind of have to go in “blind” to get an accurate result. And LLM can’t do anything “blind” because you have to train them.
A chatbots can’t even take a real IQ test, if we trained a chatbots to take a real IQ test, it would be a pointless test
Actually, you can give chatbots a real IQ test, and the range of scores fall into roughly the same spread as how they rank on other measures, with the leading model scoring at 100:
https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-100-iq
Nobody is a blank slate. Everyone has knowledge from their past experience, and instincts from their genetics. AIs are the same. They are trained on various things just as humans have experienced various things, but they can be just as blind as each other on the contents of the test.
No, they wouldn’t.
Because real IQ tests arent just multiple choice exams
You would have to train it to handle the different tasks, and training it at the tasks would make it better at the tasks, raising their scores.
I don’t know if the issue is you don’t know about how IQ tests work, or what LLM can do.
But it’s probably both instead of one or the other.
You’re entirely missing the point.
The requirements and basis of IQ tests are they are problems you haven’t seen before. An LLM works by recognizing existing data and returning what came next in the training set.
LLMs work directly in opposition of how an IQ text works.
Things like past experience are all the shit IQ tests need to avoid in order to be accurate. And they’re exactly what LLMs work off of.
By definition, LLMs have no IQ.
Standard tests don’t measure intelligence. They measure things like knowledge and skill. And ChatGPT is very knowledgeable and highly skilled.
IQ tests have the goal of measuring intelligence.
IQ tests have the goal of measuring intelligence.
The range of LLM scores on IQ tests:
https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-100-iq
Yep, very knowledgeable, highly skilled, kind of a dumbass.
You meet enough doctors and lawyers and you tend to find that combination unremarkable.
All standardized test is how well you prepared for that particular standardized test, doesn’t matter if it is the SAT, MCAT, or Leetcode. You aren’t suppose to think on the spot for these tests, you are suppose regurgitate everything you have rehearsed for weeks and months during the test.
And unthinking regurgitation is what LLMs do better than anything else.
As someone that didn’t really have good coaching on the SAT, I 100% agree. I kinda fucked it up, but at 17, I wasn’t really used to studying for things outside of school and my parents didn’t get me into any study classes
For GRE though, I studied my ass off… got top 96 percentile scores.
Also went through the leetcode grind. Bombed the first job search I ever did and then later aced the hell out of it after studying really hard.
These tests are all about how diligently you studied and your study technique.
There has been plenty of proof that standardized testing doesn’t work long before ChatGTP ever existed. Institutions will keep using them though because that’s what they’ve always done and change is hard
Not disagreeing with you; how do you suggest a way for admissions to reliably compare applicants with each other? A 3.5 at one school can mean something completely different than a 3.5 at another school.
Something like the SAT is far from perfect, but it is a way one number that means the same thing across applicants.
I think this is the point, because Harvard got rid of the SAT requirement, and then just brought it back.
It’s a really terrible measure .
But it is an equal measure, despite what it measuring moderately meaningless.
I don’t think we have a better answer yet, because everything else lacks any sort of comparable equivalency .
And I say this as an ADHD sufferer who is at a huge disadvantage on standardised testing
When I was at uni lecturers would often state that exams were thr worst measure of grasping the subject material but its all we have at the moment.
I saw this my self with some of my class mates testing very well but when discussing or problem solving outside of the class there was nothing there.
I think llms fall into this category but with way better recall.
When I was at uni lecturers would often state that exams were thr worst measure of grasping the subject material but its all we have at the moment.
It’s not all we have…
But it’s the only way a professor can run multiple classes of 100 students each.
But colleges are all about profit, so classes sizes are going to be huge.
The goal isn’t educating people, it’s making money.
So when they say “there’s no other option” they’re not mentioning the “and keep making as much money” at the end, it’s just implied.
I’m not in the us collages are generally vocational here with both colleges being less (while not totaly) concerned by the money side.
For example where I live university courses are free for those in country outside they pay fees
Dunno how it’s done elsewhere but our course are usually measured in 3 parts 1 exam 2 practical 3 essey/investigation. Everyone hates exams
It’s also the only way that is portable. A professor could evaluate each student, but has no way to transmit that kind of evaluation in a way that schools or employers across the country would trust. They didn’t know who the professor is, or what his standards are, or even if he is being bribed to pass somebody. (Which would happen much more if the professors opinion had the weight that the standardized test does. )
I had a lot of professors who put most of the grade weight on large projects. It made for a very heavy workload, but projects/ papers give a much better picture of how capable someone is of not only reciting knowledge, but also applying it.
Most of my grades were split 40/40/20
With the 3 being a writen component
Standardized tests were always a poor measure of comprehensive intelligence.
But this idea that “LLMs aren’t intelligent” popular on Lemmy is based on what seems to be a misinformed understanding of LLMs.
At this point there’s been multiple replications of the findings that transformers build world models abstracted from the training data and aren’t just relying on surface statistics.
The free version of ChatGPT (what I’m guessing most people have direct experience with) is several years old tech that is (and always has been) pretty dumb. But something like Claude 3 Opus is very advanced at critical thinking compared to GPT-3.5.
A lot of word problem examples that models ‘fail’ are evaluating the wrong thing. When you give a LLM a variation of a classic word problem, the frequency of the normal form biases the answer back towards it unless you take measures to break the token similarities. If you do that though, most modern models actually do get the variation completely correct.
So for example, if you ask it to get a vegetarian wolf, a carnivorous goat, and a cabbage across a river, even asking with standard prompt techniques it will mess up. But if you ask it to get a vegetarian 🐺, a carnivorous 🐐 and a 🥬 across, it will get it correct.
GPT-3.5 will always fail it, but GPT-4 and more advanced will get it correct. And recently I’ve started seeing models get it correct even without the variation and trip up less with variations.
The field is moving rapidly and much of what was true about LLMs a few years ago with GPT-3 is no longer true with modern models.
I don’t know… I’ve been using ChatGPT4. I use it only where the knowledge it outputs is not important. It’s good when I need help with language related things, as more of a writing assistant. Creative stuff is also OK, sometimes even impressive.
With facts? On moderately complicated topics? I’d say it gets something subtly wrong about 80% of the time, and very obviously wrong 20%. The latter isn’t the problem.
I don’t understand where the “intelligent” part would even come in. Sure, it requires a fair level of intelligence to understand and generate human language responses. But, to me, all I’ve seen fits: generate responses that seem plausible as responses to the input.
If intelligence requires some deeper understanding of the world, and the facts and relationships between them, then I don’t see it. It’s just a coincidence when it looks like it happened. It’s impressive how often that is, but it’s still all it is.
I think it highlights how a lot of these exams are just about the amount of information one can memorize.
Everyone knew this.
Obviously 1:1 mentoring, optional cohort/Custom grouping, experiential, self paced, custom versioned assignment learning is best but that’s simply not practical for a massive system.
Ask an LLM to explain a joke. It often won’t understand why a joke is funny, but that won’t stop it from trying to give you an explanation.
Those tests are not for intelligence. They’re testing whether you’ve done the pre-requisite work and acquired the skills necessary to continue advancing towards your desired career.
Wouldn’t want a lawyer that didn’t know anything about how the law works, after all, maybe they just cheated through their classes or something.
We use standardized tests because they’re cheap pieces of paper we can print out by the thousands and give out to a schoolfull of children and get an approximation of their relative intelligence among a limited range of types of intelligence. If we wanted an actual reliable measure of each kid’s intelligence type they’d get one-on-one attention and go through a range of tests, but that would cost too much (in time & money), so we just approximate with the cheap paper thing instead. Probably we could develop better tests that accounted for more kinds of intelligence, but I’m guessing those other types of intelligence aren’t as useful to capitalism, so we ignore them.
I don’t think any of those tests ever claimed to be a general intelligence test, they’re specific knowledge tests. Books also contain a ton of specific knowledge but books are not intelligent.
Tests built for humans are not tests built for machines.
We already knew that intelligence is a complex and multifaceted property, and that being very intelligent and being very good at taking tests are two distinct (albeit loosely correlated) skills. It’s just a too convenient measurement despite it’s many flaws.
This is only news if you’re an ignorant techbro who doesn’t pay attention to any other field except computer programming.
Intelligence cannot be measured. It’s a reification fallacy. Inelegance is colloquial and subjective.
If I told you that I had an instrument that could objectively measure beauty, you’d see the problem right away.
But intelligence is the capacity to solve problems. If you can solve problems quickly, you are by definition intelligent.
the ability to apply knowledge to manipulate one’s environment or to think abstractly as measured by objective criteria (such as tests)
https://www.merriam-webster.com/dictionary/intelligence
It can be measured by objective tests. It’s not subjective like beauty or humor.
The problem with AI doing these tests is that it has seen and memorized all the previous questions and answers. Many of the tests mentioned are not tests of reasoning, but recall: the bar exam, for example.
If any random person studied every previous question and answer, they would do well too. No one would be amazed that an answer key knew all the answers.
But intelligence is the capacity to solve problems. If you can solve problems quickly, you are by definition intelligent
To solve any problems? Because when I run a computer simulation from a random initial state, that’s technically the computer solving a problem it’s never seen before, and it is trillions of times faster than me. Does that mean the computer is trillions of times more intelligent than me?
the ability to apply knowledge to manipulate one’s environment or to think abstractly as measured by objective criteria (such as tests)
If we built a true super-genius AI but never let it leave a small container, is it not intelligent because WE never let it manipulate its environment? And regarding the tests in the Merriam Webster definition, I suspect it’s talking about “IQ tests”, which in practice are known to be at least partially not objective. Just as an example, it’s known that you can study for and improve your score on an IQ test. How does studying for a test increase your “ability to apply knowledge”? I can think of some potential pathways, but we’re basically back to it not being clearly defined.
In essence, what I’m trying to say is that even though we can write down some definition for “intelligence”, it’s still not a concept that even humans have a fantastic understanding of, even for other humans. When we try to think of types of non-human intelligence, our current models for intelligence fall apart even more. Not that I think current LLMs are actually “intelligent” by however you would define the term.
Does that mean the computer is trillions of times more intelligent than me?
And in addition, is an encyclopedia intelligent because it holds many answers?
This isn’t quite correct. There is the possibility of biasing the results with the training data, but models are performing well at things they haven’t seen before.
For example, this guy took an IQ test, rewrote the visual questions as natural language questions, and gave the test to various LLMs:
https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-100-iq
These are questions with specific wording that the models won’t have been trained on given he wrote them out fresh. Old models have IQ results that are very poor, but the SotA model right now scores a 100.
People who are engaging with the free version of ChatGPT and think “LLMs are dumb” is kind of like talking to a moron human and thinking “humans are dumb.” Yes, the free version of ChatGPT has around a 60 IQ on that test, but it also doesn’t represent the cream of the crop.
Maybe, but this is giving the AI a lot of help. No one rewrites visual questions for humans who take IQ tests. That spacial reasoning is part of the test.
In reality, no AI would pass any test because the first part is writing your name on the paper. Just doing that is beyond most AIs because they literally don’t have to deal with the real world. They don’t actually understand anything.
They don’t actually understand anything.
This isn’t correct and has been shown not to be correct in research over and over and over in the past year.
The investigation reveals that Othello-GPT encapsulates a linear representation of opposing pieces, a factor that causally steers its decision-making process. This paper further elucidates the interplay between the linear world representation and causal decision-making, and their dependence on layer depth and model complexity.
https://arxiv.org/abs/2310.07582
Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards (“cramming for the leaderboard”). Furthermore, simple probability calculations indicate that GPT-4’s reasonable performance on k=5 is suggestive of going beyond “stochastic parrot” behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training.
We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a self-discovery process where LLMs select multiple atomic reasoning modules such as critical thinking and step-by-step thinking, and compose them into an explicit reasoning structure for LLMs to follow during decoding. SELF-DISCOVER substantially improves GPT-4 and PaLM 2’s performance on challenging reasoning benchmarks such as BigBench-Hard, grounded agent reasoning, and MATH, by as much as 32% compared to Chain of Thought (CoT).
Just a few of the relevant papers you might want to check out before stating things as facts.
This is a semantic argument.
Have you never felt smarter or dumber depending on the situation? If so, did your ability to think abstractly, apply knowledge, or manipulate your environment change? Intelligence is subjective (and colloquial) like beauty and humor.