How much CPU hardware is needed to run this model with multiple request?
#4
by
ramda1234786
- opened
How much CPU hardware is needed to run this model with multiple request?
I want to inference this model using nginx and llama.cpp
I have no GPU but i have CPUs
Currently i tried with 64 GB RAM and 16 CPU, it runs slow with max token of 64 only it gives some response, the moment i increase the max tokens to 1024 it fails
I can add more CPU and RAM but not GPUs, can someone please suggest what is the correct approach here?
As the claim here by unsloth is reduction of computing by 60% so we should be able to run this in CPU. I can add 32 to 64 CPU if needed