Reddit user content being sold to AI company in $60M/year deal::It’s being reported that a deal has been struck to allow an unnamed large AI company to use Reddit user…

    • andrew_bidlaw@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      17
      ·
      1 year ago

      Probably because it was harvested long before they locked API. I suspect it’s not a purchase but a way to legitimize the datasets already in the works since Reddit said they are now trading them. And our favorite CEO struggles to turn any profits, so he hardly had any leverage to ask for more.

    • Grimy@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      ·
      edit-2
      1 year ago

      It’s mostly data that’s publically available. It’s more of a gamble I think, it’s only worth anything if the government decides you need to pay for the data you use in training.

  • Jo Miran@lemmy.ml
    link
    fedilink
    English
    arrow-up
    35
    arrow-down
    3
    ·
    edit-2
    1 year ago

    Remember kids, don’t delete your account. Use scripts to replace all of your posts and comments with nonesense. If there is an option in your script to feed itba “dictionary”, I highly suggest using books from the public domain like “Lady Chatterley’s Lover” by D. H. Lawrence. Replace all images and video links with Steam Boat Willie.

    • Grimy@lemmy.world
      link
      fedilink
      English
      arrow-up
      11
      ·
      1 year ago

      They sell all your edits as well. This does make it harder to scrap the data, inadvertently bringing up how much the data they sell is worth.

      • Jo Miran@lemmy.ml
        link
        fedilink
        English
        arrow-up
        6
        ·
        1 year ago

        Yeah, that’s the idea. Originally I went the “random characters then delete” route but realized that if I used randomized book excerpts from the public domain, the AI, or even a human, would have a very hard time figuring out what was real and what was trash. Ultimately, even if I can’t modify them all, I can modify enough to make it easier for the buyer to just filter my username out in order to keep the results clean.

      • BananaTrifleViolin@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        1 year ago

        I do wonder how much backup data a site like Reddit keeps. I suspect their back ups are poor as the main focus is staying live and moving forward.

        I’d imagine ability to revert a few days, maybe weeks but not much more than that? Would they see the value in keeping copies of every edit and a every deleted post? Would someone building the website even bother to build that functionality.

        Also for reddit so much of their content is based around weblinks, which give the discussions context and meaning. I bet there are an awful lot of dead links in reddit and their moves to host their own pictures and videos was probably too late. Big hosting sites have disappeared over time or deleted content, or locked down content from AI farming.

        The more I think about it, they were lucky to get $60m/year.

        • T156@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          I’d imagine ability to revert a few days, maybe weeks but not much more than that? Would they see the value in keeping copies of every edit and a every deleted post? Would someone building the website even bother to build that functionality.

          Maybe not for reversion, but I could see them keeping the edits, since it doesn’t cost them much to do so, and it could be useful for spam identification or legal purposes. For example, if an account posts spam, and then edits their comment to hide it/skirt around moderation, or vice versa.

          They would also have the benefit of the edits inflating the size of the data that they’re selling, which wouldn’t hurt.

    • CosmoNova@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Generally, what’s the best/most efficient way to make LLMs go off the rail? I mean without just typing lots of gibberish and making it too obvious. As an example: I’ve seen people formatting their prompts with java code for like 2 lines and replies instantly went nuts.

      • Jo Miran@lemmy.ml
        link
        fedilink
        English
        arrow-up
        2
        ·
        1 year ago

        I use a few dozen novels in a single text file and randomize which lines the script pulls. It then replaces the text three times with a random pull. What you end up with are four responses in plain English. Which is the real one? You could filter out responses edited after “the great exodus”, but I have been doing this to my comments a few times per year during my twelve years on reddit.

        The truth is that even if I don’t get them all, I get enough that it makes it far easier for the group that bought the data to just filter my username out rather than figure out what’s junk and what isn’t.

  • qooqie@lemmy.world
    link
    fedilink
    English
    arrow-up
    16
    ·
    1 year ago

    Does this include art OC posted there being used to train art bots? If I were posting OC art I’d just delete that shit right away, not that it’ll help I suppose

    • CosmoNova@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      ·
      1 year ago

      Honestly, I can see the appeal of a model going “fuck spez” unprompted once in a while.

  • C126@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    7
    ·
    1 year ago

    Shower thought: what if a large number of people made lots of posts and comments on reddit using only AI generated content?

    • T156@lemmy.world
      link
      fedilink
      English
      arrow-up
      12
      ·
      edit-2
      1 year ago

      Considering the spam problem, in a way, it sort of is already happening.

      It’s possible that par tof the API changes might have been to curb off that kind of behaviour before people decided to go and do just that too, or stop them using bots to wipe their profiles out.

      • Corkyskog@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        1 year ago

        Honestly, you just need to convince people to go through their comments and break any chains with nonsense. I bet that they are training conversational abilities (I mean what other good is the data set, it’s not like redditors are experts, or when there is that the experts get upvoted at all.)

  • Burn_The_Right@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    1 year ago

    This is going to backfire when the content they are selling is used by AI to make bots to make the content that gets sold to make the AI to make bots to make the content.

  • 7heo@lemmy.ml
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    1 year ago

    The annoying part is that the only use of “AI” I have so far, is “translating reddit post titles to understandable English”. Once they train their “AI” on whatever is there, I probably won’t be able to understand the “translation” anymore… Sucks. 😬

  • Grimy@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    1
    ·
    1 year ago

    This is why its so important we don’t legislate against AI and make it illegal to use scraped data. All the data is already owned by someone, putting up walls only screws us out of the open source scene.

    • g0nz0li0@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      2
      ·
      edit-2
      1 year ago

      And legislate content ownership altogether. The idea that Reddit spent more than a decade growing its community just so that it could use our content as its own property is a huge issue. How do we safely and fairly communicate and express our ideas in society where the platforms that enable this automatically claim ownership of our ideas? Social media are middlemen with outsized influence.

  • OldWoodFrame@lemm.ee
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    3
    ·
    edit-2
    1 year ago

    $60 Million or $60,000? Sometimes people use MM for Million and M for ‘Mille’ aka thousand. Other times people use M for Million and k for Thousand. Not a great article if they can’t clarify that.