‘Impossible’ to create AI tools like ChatGPT without copyrighted material, OpenAI says::Pressure grows on artificial intelligence firms over the content used to train their products
OK, so pay for it.
Pretty simple really.
Every work is protected by copyright, unless stated otherwise by the author.
If you want to create a capable system, you want real data and you want a wide range of it, including data that is rarely considered to be a protected work, despite being one.
I can guarantee you that you’re going to have a pretty hard time finding a dataset with diverse data containing things like napkin doodles or bathroom stall writing that’s compiled with permission of every copyright holder involved.deleted by creator
How hard it is doesn’t matter. If you can’t compensate people for using their work, or excluding work people don’t want users, you just don’t get that data.
There’s plenty of stuff in the public domain.
And artists are being compensated now fairly?
Previous wrongs don’t make this instance right.
now
I never said it was going to be easy - and clearly that is why OpenAI didn’t bother.
If they want to advocate for changes to copyright law then I’m all ears, but let’s not pretend they actually have any interest in that.
I can guarantee you that you’re going to have a pretty hard time finding a dataset with diverse data containing things like napkin doodles or bathroom stall writing that’s compiled with permission of every copyright holder involved.
You make this sound like a bad thing.
And why is that a bad thing?
Why are you entitled to other peoples work, just because “it’s hard to find data”?
Why are you entitled to other peoples work?
Do you really think you’ve never consumed data that was not intended for you? Never used copyrighted works or their elements in your own works?
Re-purposing other people’s work is literally what humanity has been doing for far longer than the term “license” existed.
If the original inventor of the fire drill didn’t want others to use it and barred them from creating a fire bow, arguing it’s “plagiarism” and “a tool that’s intended to replace me”, we wouldn’t have a civilization.
If artists could bar other artists from creating music or art based on theirs, we wouldn’t have such a thing as “genres”. There are genres of music that are almost entirely based around sampling and many, many popular samples were never explicitly allowed or licensed to anyone. Listen to a hundred most popular tracks of the last 50 years, and I guarantee you, a dozen or more would contain the amen break, for example.
Whatever it is you do with data: consume and use yourself or train a machine learning model using it, you’re either disregarding a large number of copyright restrictions and using all of it, or exist in an informational vacuum.
People do not consume and process data the same way an AI model does. Therefore it doesn’t matter about how humans learn, because AIs don’t learn. This isn’t repurposing work, it’s using work in a way the copyright holder doesn’t allow, just like copyright holders are allowed to prohibit commercial use.
It’s called “machine learning”, not “AI”, and it’s called that for a reason.
“AI” models are, essentially, solvers for mathematical system that we, humans, cannot describe and create solvers for ourselves, due to their complexity.
For example, a calculator for pure numbers is a pretty simple device all the logic of which can be designed by a human directly. For the device to be useful, however, the creator will have to analyze mathematical works of other people (to figure out how math works to begin with) and to test their creation against them. That is, they’d run formulas derived and solved by other people to verify that the results are correct.
With “AI” instead of designing all the logic manually, we create a system which can end up in a number of finite, yet still near infinite states, each of which defines behavior different from the other. By slowly tuning the model using existing data and checking its performance we (ideally) end up with a solver for some incredibly complex system. Such as languages or images.
If we were training a regular calculator this way, we might feed it things like “2+2=4”, “3x3=9”, “10/5=2”, etc.
If, after we’re done, the model can only solve those three expressions - we have failed. The model didn’t learn the mathematical system, it just memorized the examples. That’s called overfitting and that’s what every single “AI” company in the world is trying to avoid. (And to do so, they need a lot of diverse data)
Of course, if instead of those expressions the training set consisted of Portrait of Dora Maar, Mona Lisa, and Girl with a Pearl Earring, the model would only generate those tree paintings.
However, if the training was successful, we can ask the model to solve 3x10/5+2 - an expression it has never seen before - and it’d give us the correct result - 8. Or, in case of paintings, if we ask for a “Portrait of Mona List with a Pearl Earring” it would give us a brand new image that contains elements and styles of the thee paintings from the training set merged into a new one.
Of course the architecture of a machine learning model and the architecture of the human brain doesn’t match, but the things both can do are quite similar. Creating new works based on existing ones is not, by any means, a new invention. Here’s a picture that merges elements of “Fear and Loathing in Las Vegas” and “My Little Pony”, for example.
The major difference is that skills and knowledge of individual humans necessary to do things like that cannot be transferred or lend to other people. Machine learning models can be. This tech is probably the closest we’ll even be to being able to shake skills and knowledge “telepathically”, so to say.
I’m well aware of how machine learning works. I did 90% of the work for a degree in exactly it. I’ve written semi-basic neural networks from scratch, and am familiar with terminology around training and how the process works.
Humans learn, process, and most importantly, transform data in a different manner than machines. The sum totality of the human existence each individual goes through means there is a transformation based on that existence that can’t be replicated by machines.
A human can replicate other styles, as you show with your example, but that doesn’t mean that is the total extent of new creation. It’s been proven in many cases that civilizations create art in isolation, not needing to draw from any previous art to create new ideas. That’s the human element that can’t be replicated in anything less than true General AI with real intelligence.
Machine Learning models such as the LLMs/GenerativeAI of today are statistically based on what it has seen before. While it doesn’t store the data, it does often replicate it in its outputs. That shows that the models that exist now are not creating new ideas, rather mixing up what they already have.
If it ends up being OK for a company like OpenAI to commit copyright infringement to train their AI models it should be OK for John/Jane Doe to pirate software for private use.
But that would never happen. Almost like the whole of copyright has been perverted into a scam.
You wouldn’t steal a car, would you?
Using copyrighted material is not the same thing as copyright infringement. You need to (re)publish it for it to become an infringement, and OpenAI is not publishing the material made with their tool; the users of it are. There may be some grey areas for the law to clarify, but as yet, they have not clearly infringed anything, any more than a human reading copyrighted material and making a derivative work.
It comes from OpenAI and is given to OpenAI’s users, so they are publishing it.
It’s being mishmashed with a billion other documents just like to make a derivative work. It’s not like open hours giving you a copy of Hitchhiker’s Guide to the Galaxy.
New York Times was able to have it return a complete NYT article, verbatim. That’s not derivative.
I thought the same thing until I read another perspective into it from Mike Masnick and, from what he writes, it seems pretty clear they manipulated ChatGPT with some very specific prompts that someone who doesn’t already pay NYT for access would not be able to do. For example, feeding it 3 verbatim paragraphs from an article and asking it to generate the rest if you understand how these LLMs work, its really not surprising that you can indeed force it to do things like that but it’s an extreme and I’m qith Masnick and the user your responding to on this one myself.
I also watched most of today’s subcommittee hearing on AI and journalism. A lot of the arguments are that this will destroy local journalism. Look, strong local journalism is some of the most important work that is dying right now. But the grave was dug by these large media companies and hedge funds that bought up and gutted those local news orgs and not many people outside of the industry batted an eye while that was happening. This is a bit of a tangent but I don’t exactly trust the giant headgefunds who gutted these local news journalists ocer the padt deacde to all of a sudden care at all about how important they are.
Sorry fir the tangent butbheres the article i mentioned thats more on topic - http://mediagazer.com/231228/p11#a231228p11
So they gave it the 3 paragraphs that are available publicly, said continue, and it spat out the rest of the article that’s behind a paywall. That sure sounds like copyright infringement.
And that’s not the intent of the service, it’s a bug and they’ll fix it.
Insane how this comment is downvoted, when, as far as a I’m aware, it’s literally just the legal reality at this point in time.
any more than a human reading copyrighted material and making a derivative work.
It seems obvious to me that it’s not doing anything different than a human does when we absorb information and make our own works. I don’t understand why practically nobody understands this
I’m surprised to have even found one person that agrees with me
Because it’s objectively not true. Humans and ML models fundamentally process information differently and cannot be compared. A model doesn’t “read a book” or “absorb information”
I didn’t say they processed information the same, I said generative AI isn’t doing anything that humans don’t already do. If I make a drawing of Gordon Freeman or Courage the Cowardly Dog, or even a drawing of Gordon Freeman in the style of Courage the Cowardly Dog, I’m not infringing on the copyright of Valve or John Dilworth. (Unless I monetize it, but even then there’s fair-use…)
Or if I read a statistic or some kind of piece of information in an article and spoke about it online, I’m not infringing the copyright of the author. Or if I listen to hundreds of hours of a podcast and then do a really good impression of one of the hosts online, I’m not infringing on that person’s copyright or stealing their voice.
Neither me making that drawing, nor relaying that information, nor doing that impression are copyright infringement. Me uploading a copy of Courage or Half-Life to the internet would be, or copying that article, or uploading the hypothetical podcast on my own account somewhere. Generative AI doesn’t publish anything, and even if it did I think there would be a strong case for fair-use for the same reasons humans would have a strong case for fair-use for publishing their derivative works.
I guess the lesson here is pirate everything under the sun and as long as you establish a company and train a bot everything is a-ok. I wish we knew this when everyone was getting dinged for torrenting The Hurt Locker back when.
Remember when the RIAA got caught with pirated mp3s and nothing happened?
What a stupid timeline.
Its almost like we had a thing where copyrighted things used to end up but they extended the dates because money
This is where they have the leverage to push for actual copyright reform, but they won’t. Far more profitable to keep the system broken for everyone but have an exemption for AI megacorps.
I was literally about to come in here and say it would be an interesting tangential conversation to talk about how FUCKED copyright laws are, and how relevant to the discussion it would be.
More upvote for you!
deleted by creator
If the rule is stupid or evil we should applaud people who break it.
we should use those who break it as a beacon to rally around and change the stupid rule
Except they pocket millions of dollars by breaking that rule and the original creators of their “essential data” don’t get a single cent while their creations indirectly show up in content generated by AI. If it really was about changing the rules they wouldn’t be so obvious in making it profitable, but rather use that money to make it available for the greater good AND pay the people that made their training data. Right now they’re hell-bent in commercialising their products as fast as possible.
If their statement is that stealing literally all the content on the internet is the only way to make AI work (instead of for example using their profits to pay for a selection of all that data and only using that) then the business model is wrong and illegal. It’s as a simple as that.
I don’t get why people are so hell-bent on defending OpenAI in this case; if I were to launch a food-delivery service that’s affordable for everyone, but I shoplifted all my ingredients “because it’s the only way”, most would agree that’s wrong and my business is illegal. Why is this OpenAI case any different? Because AI is an essential development? Oh, and affordable food isn’t?
I am not defending OpenAi I am attacking copyright. Do you have freedom of speech if you have nothing to say? Do you have it if you are a total asshole? Do you have it if you are the nicest human who ever lived? Do you have it and have no desire to use it?
Wow! You’re telling me that onerous and crony copyright laws stifle innovation and creativity? Thanks for solving the mystery guys, we never knew that!
innovation and creativity
Neither of which are being stiffled here. OpenAI didn’t write ChatGPT with copyrighted code.
What’s being “stiffled” is corporate harvesting and profiting of the works of individuals, at their expense. And damn right it should be.
at their expense
How?
‘Data poisoning’, encryption, & copyright.
Please show me the poor artist whose work was stolen. I want a name.
If there is no victim there is no crime.
finally capitalism will notice how many times it has shot up its own foot with their ridiculous, greedy infinite copyright scheme
As a musician, people not involved in the making of my music make all my money nowadays instead of me anyway. burn it all down
Pitchfork fest 2024
… that’s a good album name, might use that ;)
it would sell
if it’s impossible for you to have something without breaking the law you have to do without it
if it’s impossible for the artistocrat class to have something without breaking the law, we change or ignore the law
I’m dumbfounded that any Lemmy user supports OpenAI in this.
We’re mostly refugees from Reddit, right?
Reddit invited us to make stuff and share it with our peers, and that was great. Some posts were just links to the content’s real home: Youtube, a random Wordpress blog, a Github project, or whatever. The post text, the comments, and the replies only lived on Reddit. That wasn’t a huge problem, because that’s the part that was specific to Reddit. And besides, there were plenty of third-party apps to interact with those bits of content however you wanted to.
But as Reddit started to dominate Google search results, it displaced results that might have linked to the “real home” of that content. And Reddit realized a tremendous opportunity: They now had a chokehold on not just user comments and text posts, but anything that people dare to promote online.
At the same time, Reddit slowly moved from a place where something may get posted by the author of the original thing to a place where you’ll only see the post if it came from a high-karma user or bot. Mutated or distorted copies of the original instance, reformated to cut through the noise and gain the favor of the algorithm. Re-posts of re-posts, with no reference back to the original, divorced of whatever context or commentary the original creator may have provided. No way for the audience to respond to the author in any meaningful way and start a dialogue.
This is a miniature preview of the future brought to you by LLM vendors. A monetized portal to a dead internet. A one-way street. An incestuous ouroborous of re-posts of re-posts. Automated remixes of automated remixes.
–
There are genuine problems with copyright law. Don’t get me wrong. Perhaps the most glaring problem is the fact that many prominent creators don’t even own the copyright to the stuff they make. It was invented to protect creators, but in practice this “protection” gets assigned to a publisher immediately after the protected work comes into being.
And then that copyright – the very same thing that was intended to protect creators – is used as a weapon against the creator and against their audience. Publishers insert a copyright chokepoint in-between the two, and they squeeze as hard as they desire, wringing it of every drop of profit, keeping creators and audiences far away from each other. Creators can’t speak out of turn. Fans can’t remix their favorite content and share it back to the community.
This is a dysfunctional system. Audiences are denied the ability to access information or participate in culture if they can’t pay for admission. Creators are underpaid, and their creative ambitions are redirected to what’s popular. We end up with an auto-tuned culture – insular, uncritical, and predictable. Creativity reduced to a product.
But.
If the problem is that copyright law has severed the connection between creator and audience in order to set up a toll booth along the way, then we won’t solve it by giving OpenAI a free pass to do the exact same thing at massive scale.
Mutated or distorted copies of the original instance, reformated to cut through the noise and gain the favor of the algorithm. Re-posts of re-posts, with no reference back to the original, divorced of whatever context or commentary the original creator may have provided… This is a miniature preview of the future brought to you by LLM vendors. A monetized portal to a dead internet. A one-way street. An incestuous ouroborous of re-posts of re-posts. Automated remixes of automated remixes.
The internet is genuinely already trending this way just from LLM AI writing things like: articles and bot reviews, listicle and ‘review’ websites that laser focus for SEO hits, social media comments and posts to propagandize or astroturf…
We are going to live and die by how the Captcha-AI arms race is ran against the malicious actors, but that won’t help when governments or capital give themselves root access.
Too long didn’t read, busy downloading a car now. How much did Disney pay for this comment?
This situation seems analogous to when air travel started to take off (pun intended) and existing legal notions of property rights had to be adjusted. IIRC, a farmer sued an airline for trespassing because they were flying over his land. The court ruled against the farmer because to do otherwise would have killed the airline industry.
I member
And we did so before then with ‘Mineral Rights’. You can drill for oil on your property but If you find it - it ain’t yours because you only own what you can walk on in many places. Capitalists are gonna capitalize
If the copyright people had their way we wouldn’t be able to write a single word without paying them. This whole thing is clearly a fucking money grab. It is not struggling artists being wiped out, it is big corporations suing a well funded startup.
It’s not “impossible”. It’s expensive and will take years to produce material under an encompassing license in the quantity needed to make the model “large”. Their argument is basically “but we can have it quickly if you allow legal shortcuts.”
That argument has unfortunately worked for many other Tech Bros
Whenever a company says something is impossible, they usually mean it’s just unprofitable.
The law is shit
Then LLMs should be FOSS
All AI should be FOSS and public domain, owned by the people, and all gains from its use taxed at 100%. It’s only because of the public that AI exists, through the schools, universities, NSF, grants, etc and all the other places that taxes have been poured into that created the advances upon which AI stands, and the AI critical research as well.
That does nothing to solve the problem of data being used without consent to train the models. It doesn’t matter if the model is FOSS if it stole all the data it trained on.
The only way I can steal data from you is if I break into your office and walk off with your hard drive. Do you have access to something? It hasn’t been stolen.
deleted by creator
If I steal something from you I have it and you don’t. When I copy an idea from you, you still have the idea. As a whole the two person system has more knowledge. While actual theft is zero sum. Downloading a car and stealing a car are not the same thing.
And don’t even try the awarding artists and inventor argument. Companies that fund R&D get tax breaks for it, so they already get money. An artists are rarely compensated appropriately.
Piracy by another name. Copyrighted materials are being used for profit by companies that have no intention of compensating the copyright holder.