I’m rather curious to see how the EU’s privacy laws are going to handle this.

(Original article is from Fortune, but Yahoo Finance doesn’t have a paywall)

  • DigitalWebSlinger@lemmy.world
    link
    fedilink
    English
    arrow-up
    103
    ·
    1 year ago

    “AI model unlearning” is the equivalent of saying “removing a specific feature from a compiled binary executable”. So, yeah, basically not feasible.

    But the solution is painfully easy: you remove the data from your training set (ie, the source code), and re-train your model (recompile the executable).

    Yes, it may cost you a lot of time and money to accomplish this, but such are the consequences of breaking the law. Maybe be extra careful about obeying laws going forward, eh?

    • CoderKat@lemm.ee
      link
      fedilink
      English
      arrow-up
      7
      arrow-down
      1
      ·
      1 year ago

      Retraining the model is incredibly expensive. That basically means not training the model with any user data, even if it slips in accidentally, by someone sabotage the training data, or even with consent (since consent can be revoked).

      • Thann@lemmy.ml
        link
        fedilink
        English
        arrow-up
        12
        arrow-down
        2
        ·
        1 year ago

        consent cant be revoked, theyre not even trying to get consent.

        They seemingly all have a “use first then ask for forgiveness” approach which should come around to bite them in the ass

        • Jaded@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          6
          arrow-down
          1
          ·
          1 year ago

          Anything else is going to bite US in the ass. Asking for consent kills any kind of open source development. It puts AI solely in the hands of like three companies. Our economy is going to be very AI focused in the future, they would literally own all of us.

          You aren’t getting paid either way so we might as well all enjoy the fruits of humanities labor freely instead of been forced into a subscription model of it.

            • Jaded@lemmy.dbzer0.com
              link
              fedilink
              English
              arrow-up
              1
              ·
              1 year ago

              “Most of the data used by large companies isn’t available to the majority of people. We think that stifles innovation.”

              Yes crowd sourcing is a solution but is only really possible if you are able to reach many people like Mozilla can. They only have 20k of hours up to date. Tortoise needed 50k hours and was made by one guy who open sourced it. He would not have been able to build without scraping YouTube.

              Crowd sourcing also becomes much more complicated for llms or if you are making models in other language.

          • Fushuan [he/him]@lemm.ee
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            Asking for consent doesn’t kill open source development. Consent is the very reason we have licensed code. MIT, Apache, GPL3… And development is done and code is reused in accordance of those licenses.

            • Jaded@lemmy.dbzer0.com
              link
              fedilink
              English
              arrow-up
              1
              ·
              1 year ago

              Making llms requires a stupid amount of data, much more than what is found in the creative commons. Same goes for image gen. Unless you have been accumulating data since forever through tricking people when they sign up to your website or app, you can’t train anything without scraping most of the data.

              It has nothing to do with licensing but the fact that there just isn’t enough “free-use” data.

    • Asymptote@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      1 year ago

      “removing a specific feature from a compiled binary executable”

      That’s how patches used to be 😆

    • Ajen@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      3
      ·
      1 year ago

      removing a specific feature from a compiled binary executable

      That’s actually very feasible. Compiled binaries translate directly to assembly, which is taught to most (all?) comp sci undergrads. When the binary is compiled by a standard compiler the translated assembly is very easy to understand, and for software that has protections/obfuscations like DRM and viruses there are reverse engineering tools like IDA Pro.

    • Fushuan [he/him]@lemm.ee
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      2
      ·
      1 year ago

      A trained AI model is a set of weights that is applied to the given neural network, the difference between two models, one trained without key data and one trained with key data, can be computed and a tool can be created to generate a transformation from model A to model B, or even a good approximation of model B trained with another AI.

      It’s not THAT hard actually.

      • applebusch@lemmy.world
        link
        fedilink
        English
        arrow-up
        6
        ·
        1 year ago

        I don’t doubt that mathematically, but practically that sounds like it would be functionally equivalent to just retraining the model. Like if it were more efficient to just calculate the model weights based on input data, that’s what we would do, there would be no need to go through the training process. We could just start with a completely untrained model and calculate the difference between that model and one that was trained with all the data. The more I think about it the more I doubt that mathematically. The feasibility of this would depend heavily on the details of the model and how it was trained. Lots of times the order in which the data was presented during training has an impact on the final result, so there’s no guarantee your subtraction would achieve the same or even similar result as retraining without the specified data. Maybe you can reference some papers on the topic.

        • Fushuan [he/him]@lemm.ee
          link
          fedilink
          English
          arrow-up
          3
          ·
          edit-2
          1 year ago

          I have a bachelors in computer science specialised in data engineering and data science, with a masters in data science, and I have worked for some years in computer vision, training and tweaking models.

          Currently specialised in data engineering, but I’d wager I do know about what I’m talking about.

          People who “work with AI” most of the time don’t know shit about how it internally works, so I don’t know if that’s a label I’d even use to give an informed opinion about the matter.

    • Dkarma@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      1 year ago

      It takes so.much money to retrain models tho…like the entire cost all over again …and what if they find something else?

      Crazy how murky the legalities are here …just no caselaw to base anything on really

      For people who don’t know how machine learning works at a very high level

      basically every input the AI is trained on or “sees” changes a set of weights (float type decimal numbers) and once the weights are changed you can’t remove that input and change the weights back to what they were you can only keep changing them on new input

      • DigitalWebSlinger@lemmy.world
        link
        fedilink
        English
        arrow-up
        9
        ·
        1 year ago

        So we just let them break the law without penalty because it’s hard and costly to redo the work that already broke the law? Nah, they can put time and money towards safeguards to prevent themselves from breaking the law if they want to try to make money off of this stuff.

        • Dkarma@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          1 year ago

          No one has established that they’ve broken the law in any way, though. Authors are upset but it’s unclear if they can prove they were damaged in some way or that the companies in question are even liable for anything.

          Remember,the burden of proof is on the plaintiff not these companies if a suit is brought.

    • AWittyUsername@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      1 year ago

      Much like DLLs exist for compiled binary executables, could we not have modular AI training data? Then only a small chunk would need to be relearned at a time.

      Just throwing this into the void here.

      • Aceticon@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        ·
        1 year ago

        The difference in between having or not something in the training set of a Neural Network is going to be different values for non-integer factors all over the neural network and, worse, it is just as like that they’re tiny differences as it is that they’re massive differences.

        Or to give you a decent metaphor for it, “it would be like trying to remove a specific egg from a bowl of scrambled eggs”.

        • hglman@lemmy.ml
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          The issue is the ownership of the AI; if it were not ownable or instead owned by everyone, there wouldn’t be an issue.

          • trashgirlfriend@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            edit-2
            1 year ago

            Ah yes, let’s just quickly switch the mode of production in this industry, I’m sure that’s going to happen.

            I also don’t want my data to be processed by the fully automated luxy gay space machine learning algorithms either.

  • Treczoks@lemmy.world
    link
    fedilink
    English
    arrow-up
    21
    arrow-down
    1
    ·
    1 year ago

    Delete the AI and restart the training from the original sources minus the information it should not have learned in the first place.

    And if they claim “this is more complicated than that” you know their process is f-ed up.

    • gressen@lemm.ee
      link
      fedilink
      English
      arrow-up
      7
      ·
      1 year ago

      You’re right, this is a way to solve this issue. It’s just not economically feasible to retrain your model from scratch every time. It takes a lot of money to do it and they will push back.

  • Dran@lemmy.world
    link
    fedilink
    English
    arrow-up
    14
    ·
    1 year ago

    Or you know, if it’s impossible to strip out individual data, and it’s too expensive to retain/retrain models with data removed… Why is everyone overlooking “just don’t process private data, and only use public data in model training”?

    • Dojan@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      ·
      1 year ago

      Yeah. Penalise it heavily so if you need to make a model, make manually vetting the data the most affordable option.

      Ultimately, ensuring models are trained on safe, good, legal data, and not just random bullshit scraped off of the internet, will just be a net positive overall.

      • assassin_aragorn@lemmy.worldOP
        link
        fedilink
        English
        arrow-up
        3
        ·
        1 year ago

        Along those lines, perhaps you put in a stipulation that you don’t have to toss the model if you instead give the person a significant sum in royalties. After all, if their data isn’t a lynchpin in the model, you didn’t need it in the first place, and if it is crucial, you should pay them accordingly.

        Punitive regulations seem to be the best way to make companies grow a sense of ethics.

  • efrique@lemm.ee
    link
    fedilink
    English
    arrow-up
    14
    ·
    edit-2
    1 year ago

    Then delete and start over, or don’t use data you don’t have explicit permission to use. in the first place.

    It’s like a thief saying “well, I already fenced most of the stuff so it’s too hard to give any of it back. So let’s just call it quits, eh?”

  • Fades@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    4
    ·
    1 year ago

    Everyone in the thread so triggered lol, so you hear yourselves?