• 12 Posts
  • 272 Comments
Joined 2 years ago
cake
Cake day: June 16th, 2023

help-circle

  • I honestly don’t know how to read the situation. Ukraine’s fought terrifically, but their status seems far less sustainable even if you discount the Trump stuff. I don’t put a lot of stock in these claims that Russia is on the verge of imploding due to the stress of the war, any day now. It is possible, but mostly seems like wishful thinking.

    External aid changes the situation a bit, but not ultimately that much because no Western power seems willing to directly intervene with troops. Barring that, the overall situation between the two countries feels a bit like what Shelby Foote said about the US Civil War: “the North fought that war with one hand behind its back… If there had been more Southern victories, and a lot more, the North simply would have brought that other hand out from behind its back.”




  • “European leaders” operate under a permanent disadvantage because they have to agree among themselves to do anything. This leaves them unable to take the initiative geopolitically, and prone to taking whatever’s the path of least resistance lying before them. The US and Russia have concluded that Europe will roll over and accept whatever they are presented with, after some angsty wailing, and unfortunately they are probably right. Not inviting Europe to talks is just a dominance move showing that they know the Europeans can’t do anything about it.

    Unfortunately for Europe, this is just the logical end point of their institutional arrangements. In a domain like geopolitics, where there are intelligent players looking for advantage, it is suicidal to turn off your ability to make decisions.


  • Dylan’s just being deliberately obtuse. Deepseek developed a way to increase training efficiency and backed it up by quoting the training cost in terms of the market price of the GPU time. They didn’t include the cost of the rest of their datacenter, researcher salaries, etc., because why would you include those numbers when evaluating model training efficiency???

    The training efficiency improvement passes the sniff test based on the theory in their paper, and people have done back of the envelope calculations that also agree with the outcome. There’s little reason to doubt it. In fact people have made the opposite criticism, that none of Deepseek’s optimizations are individually groundbreaking and all they did is “merely engineering” in terms of putting a dozen or so known optimization ideas together.




  • Aww come on. There’s plenty to be mad at Zuckerberg about, but releasing Llama under a semi-permissive license was a massive gift to the world. It gave independent researchers access to a working LLM for the first time. For example, Deepseek got their start messing around with Llama derivatives back in the day (though, to be clear, their MIT-licensed V3 and R1 models are not Llama derivatives).

    As for open training data, its a good ideal but I don’t think it’s a realistic possibility for any organization that wants to build a workable LLM. These things use trillions of documents in training, and no matter how hard you try to clean the data, there’s definitely going to be something lawyers can find to sue you over. No organization is going to open themselves up to the liability. And if you gimp your data set, you get a dumb AI that nobody wants to use.


  • It’s definitely a trend. More and more top Chinese students are also opting to stay in China for university, rather than going to the US or Europe to study. It’s in part due to a good thing, i.e. the improving quality of China’s universities and top companies. But I think it’s a troubling development for China overall. One of China’s strengths over the past few decades has been their people’s eagerness to engage with the outside world, and turning inward will not be beneficial for them in the long run.




  • Base models are general purpose language models, mainly useful for AI researchers and people who want to build on top of them.

    Instruct or chat models are chatbots. They are made by fine-tuning base models.

    The V3 models linked by OP are Deepseek’s non-reasoning models, similar to Claude or ChatGPT4o. These are the “normal” chatbots that reply with whatever comes to their mind. Deepseek also has a reasoning model, R1. Such models take time to “think” before supplying their final answer; they tend to give better performance for stuff like math problems, at the cost of being slower to get the answer.

    It should be mentioned that you probably won’t be able to run these models yourself unless you have a data center style rig with 4-5 GPUs. The Deepseek V3 and R1 models are chonky beasts. There are smaller “distilled” forms of R1 that are possible to run locally, though.




  • Intriguingly, there’s reason to believe the R1 distills are nowhere close to their peak performance. In the R1 paper they say that the models are released as proofs of concept of the power of distillation, and the performance can probably be improved by doing an additional reinforcement learning step (like what was done to turn V3 into R1). But they said they basically couldn’t be bothered to do it and are leaving it for the community to try.

    2025 is going to be very interesting in this space.



  • No AI org of any significant size will ever disclose its full training set, and it’s foolish to expect such a standard to be met. There is just too much liability. No matter how clean your data collection procedure is, there’s no way to guarantee the data set with billions of samples won’t contain at least one thing a lawyer could zero in on and drag you into a lawsuit over.

    What Deepseek did, which was full disclosure of methods in a scientific paper, release of weights under MIT license, and release of some auxiliary code, is as much as one can expect.