Thx in advice.
For LLMs it entirely depends on what size models you want to use and how fast you want it to run. Since there’s diminishing returns to increasing model sizes, i.e. a 14B model isn’t twice as good as a 7B model, the best bang for the buck will be achieved with the smallest model you think has acceptable quality. And if you think generation speeds of around 1 token/second are acceptable, you’ll probably get more value for money using partial offloading.
If your answer is “I don’t know what models I want to run” then a second-hand RTX3090 is probably your best bet. If you want to run larger models, building a rig with multiple (used) RTX3090 is probably still the cheapest way to do it.
I feel like this really depends on what hardware you have access too. What are you interested in doing?How long are you willing to wait for it to generate, and how good do you want it to be?
You can pull off like 0.5 word per second of one of the mistral models on the CPU with 32GB of RAM. The stabediffusion image models work okay with like 8-16GB of vram.
Automatic1111 for Stable Diffusion and Ollama for LLMs
KobaldCPP or LocalAI will probably be the easiest way out of the box that has both image generation and LLMs.
I personally use vllm and HuggingChat, mostly because of vllm’s efficiency and speed increase.
deleted by creator