Need more ram for llama3 70b inference, cant afford a100s being more expensive than my fking car. Anyways ordered m2 ultra from apple for 8 grand to get the 192 gb unified ram.
I'm sure it's suboptimal compared to Nvidia builds e.g 2x4090, in terms of performance, no cuda training and what not, but the 2-3k markup for custom build making no sense to me. On the other hand rental for GPU server with enough ram for llama3 seems around 5usd hour and I'm fed up with BS AWS bills of hundred dollars per month for instances that are not even running.
Thinking about canceling the order and look further. What would you get with 8 grand for local llm?
Shopping Cart Returner Shirt $21.68 |
Nothing Ever Happens Shirt $21.68 |
Shopping Cart Returner Shirt $21.68 |
iToddlers BTFO
what are you doing with a local llm that requires this?
Code gen
What is your business model?
More AI, faster prototyping, less codes to write/maintain
>*slowly writing 'Bankruptcy' in comically large letters on a clipboard*
mhm yes i see
itodeleres butfo
why are you running llama 3 local?
Lower error rates
Tasks with contexts of proprietary stuff
I have one. Inference speed sucks but MoE models make up for that while still benefiting from the large amount of memory. It’s nice that it can sit on a shelf and be quiet and cool and sip power. Whether or not that’s still worth $8k is up to you.
Thanks for sharing your thought, yea the ram seems generous enough for quite a few good models with f16.
speed wise, some report it get 2 t/s on larger models, which crunches about 100k t/day worth of task queue.
Not sure if yours saw better speed than that
Write it up as a company/department expense.
wouldn't a modern server cpu be faster and cheaper? Main bottleneck is ram speed when doing it on the cpu I am pretty sure you can buy some very high quality ram for 8k
unlikely, because gpu offloading
dam, a 30xP40s grid would make quite the sight to behold, which presumably purrs like a ford GT
Twitter guy is the embodiment cyberpunk
>unlikely, because gpu offloading
on the m2 ultra? I doubt that I want to see a benchmark of someone comparing a thread ripper with the m2 ultra gpu for matrix multiplication. Modern processors should get quite close if competing against something integrated.
>Modern processors should get quite close if competing against something integrated.
bullshit.
a GPU will have like a magnitude more ALUs than a CPU.
https://github.com/ggerganov/llama.cpp/discussions/4167
>m2 ultra prompt processing llama 7b f16
>1401.85 tokens/second
>5800X3D prompt processing llama 7b f16
>30.69 tokens/second
m2 ultra is 46x as fast as a 5800X3D for matmul.
this doesn't make much sense to me why the frick would modern cpus be that much slower? I was expecting maybe x10 times as fast not x49
>he spent 8k on iShit for an LLM build
>only 192
>could have bought a server rack with 30 P40s and 720GB of VRAM
>30 used p40s
useless. will fail < 1 year
I think OP is just low key trolling
It's not all bad, you could spend another $10K, get your dual 4090 system and another mac studio. I saw that multiple mac studios connected via Thunderbolt ethernet tunneling could theoretically provide a way larger vram pool...but the software isn't there yet. lol.
https://x.com/ronaldmannak/status/1784087769817756138
Sometimes less is more. Why exactly can't you just get:
>a desktop PC with a single 4090
>a macbook
>a thinkpad
If you are on the move, just ssh into your desktop or however that works with llm im sure you can mount it or whatever technique people use
https://twitter.com/kellerjordan0/status/1765649009891328253
Already have the rest.
Twitter post sounds promising, except as of now 100GB of weights still take 100GB of ram to load, before they publish the million dollar software to fit it into a 24gb 4090
Interesting guide, thanks, might spend 1~2k playing around with retired server parts in the future but can't count on them for production
Did you consult /llm/ local ml build autist?
https://rentry.org/V100MAXXING#pcie
He has multiple build guides that perform better but for lile $1k to $2k
Its a lot of running around ebay and obscure vendors a lot of the time
https://rentry.org/miqumaxx
This is his CPUMAXX'ing guide to essentially build what you buy from apple for cheaper
i mean, an amd/inhell laptop with so much ram could cost you 2-3k... so, yeah, you probably did frick up.
the difference is on mac it's VRAM also, there is nothing with this much VRAM in consumer space.
macs are ultra shit for LLM because of dogshit prompt processing speed. Inference speed might be as well be instant, but you still gonna wait a few minutes for your 10k context prompt to process first. Also in case you decide you want something more than LLM - macs can't do anything else. Image gen, audio gen, text to speech, nothing works with apple gpus, it's all cuda or bust.
You are a fricking moron.
why its so fricking hard to build your own 2xGPU rig? Are you gay? Did your dad not buy you legos? Did you even have a dad?
https://pcpartpicker.com/list/zFHxMV
Thanks not op but ive been looking for a consumercuckmaxx build and this is it
what's the actual use case of the top end expensive macs? like the 10k+ ones?
always thought it was for people that make tons of content. doesn't make sense to me but i don't do that work, and for some people money is no object
>the 2-3k markup for custom build making no sense to me
and the 6k markup for apple logo did?