• Lvxferre@mander.xyz
    link
    fedilink
    English
    arrow-up
    6
    arrow-down
    3
    ·
    6 months ago

    It doesn’t need to be filtered into human / AI content. It needs to be filtered into good (true) / bad (false) content. Or a “truth score” for each.

    That isn’t enough because the model isn’t able to reason.

    I’ll give you an example. Suppose that you feed the model with both sentences:

    1. Cats have fur.
    2. Birds have feathers.

    Both sentences are true. And based on vocabulary of both, the model can output the following sentences:

    1. Cats have feathers.
    2. Birds have fur.

    Both are false but the model doesn’t “know” it. All that it knows is that “have” is allowed to go after both “cats” and “birds”, and that both “feathers” and “fur” are allowed to go after “have”.

    • KevonLooney@lemm.ee
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      4
      ·
      6 months ago

      It’s not just a predictive text program. That’s been around for decades. That’s a common misconception.

      As I understand it, it uses statistics from the whole text to create new text. It would be very rare to output “cats have feathers” because that phrase doesn’t ever appear in the training data. Both words “have feathers” never follow “cats”.

      • skulblaka@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        7
        arrow-down
        1
        ·
        6 months ago

        But the fact remains that it doesn’t know what a cat or a feather is. All of this is still based purely on statistical frequency and not at all on actual meanings.

      • barsoap@lemm.ee
        link
        fedilink
        English
        arrow-up
        3
        arrow-down
        1
        ·
        edit-2
        6 months ago

        because that phrase doesn’t ever appear in the training data.

        Eh but LLMs abstract. It has seen “<animal> have feathers” and “<animal> have fur” quite a lot of times. The problem isn’t that LLMs can’t reason at all, the problem is that they do employ techniques used in proper reasoning, in particular tracking context throughout the text (cross-attention) but lack techniques necessary for the whole thing, instead relying on confabulation to sound convincing regardless of the BS they spout. Suffices to emulate an Etonian but that’s not a high standard.

      • WalnutLum@lemmy.ml
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        6 months ago

        This isn’t really accurate either. At the moment of generation, an LLM only has context for the input string and the network of text tokens it’s been assigned. It pulls from a “pool” of these tokens based on what it’s already output and the input context, nothing more.

        Most LLMs have what are called “Top P”, “Top K” etc, these are the number of tokens that it ends up selecting from based on the previous token, alongside the input tokens. It then randomly chooses one based on temperature settings.

        It’s why if you turn these models’ temperature settings really high they output pure nonsense both conceptually and grammatically, because the tenuous thread linking the previous token’s context to the next token has been widened enough that it completely loses any semblance of cohesiveness.

    • CeeBee_Eh@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      6 months ago

      Both sentences are true. And based on vocabulary of both, the model can output the following sentences:

      1. Cats have feathers.
      2. Birds have fur

      This is not how the models are trained or work.

      Both are false but the model doesn’t “know” it. All that it knows is that “have” is allowed to go after both “cats” and “birds”, and that both “feathers” and “fur” are allowed to go after “have”.

      Demonstrably false. This isn’t how LLMs are trained or built.

      Just considering the contextual relationships between word embeddings that are created during training is evidence enough. Those relationships from the multi-vector fields are an emergent property that doesn’t exist in the dataset.

      If you want a better understanding of what I just said, take a look at this Computerphile video from four years ago. And this came out before the LLM hype and before ChatGPT 3, which was the big leap in LLMs.