Redditã«ãVRAMè¶³ããªãã¨ãä¸é¨ã®ã¬ã¤ã¤ã¼ãCPUã«ä»»ãããã§ã¯ãªããã¬ã¤ã¤ã¼å
¨é¨ãGPUã«è¼ãã¦ã¬ã¤ã¤ã¼å
é¨ã®FFNã ãCPUã«æã£ã¦ãã£ããéããªã£ãããªãã§ãããæ¨æºãããªããã ãã¨ããã®ããã£ãã®ã§ãããã¡ã®RTX 4060 Ti 16GBã§è©¦ãã¦ã¿ããå¾®å¦ã«éããªãã¾ããã
https://www.reddit.com/r/LocalLLaMA/comments/1ki7tg7/dont_offload_gguf_layers_offload_tensors_200_gen/

Qwen3 30B A3Bã§è©¦ãã¦ã¿ã
ãããã£ãæå®ãOllamaãLM Studioã§ã¯ã§ããªãã®ã§ãä»åã¯KoboldCPPã¨ããã®ã§è©¦ãã¦ã¾ãã
https://github.com/LostRuins/koboldcpp
KoboldCPPã§ã¯å®ç¨ãå³ããã®ã§ãllama.cppã§è©¦ãã»ããããããã
ã¨ãããããLM Studioã§Qwen3 30B A3Bã®q3_k_xlãåãããã¨ãã¯15.58tok/sec

48ä¸38ã¬ã¤ã¤ã¼ãGPUã«å²ãå½ã¦ã¦ãã¾ãã

ã¨ãããã¨ã§ãkoboldcppã®å®è¡ããã¦ã³ãã¼ãããå®è¡ãã¡ã¤ã«ã«--overridetensorsã¨--modelã¨--gpulayersãæå®ãã¦èµ·åãã¾ãã
koboldcpp.exe --overridetensors "blk\.([0-9]*[05])\.ffn_.*_exps\.=CPU" --model "D:\dev\gguf\unsloth\Qwen3-30B-A3B-GGUF\Qwen3-30B-A3B-UD-Q3_K_XL.gguf" --gpulayers 48
--overridetensors "blk\.([0-9]*[05])\.ffn_.*_exps\.=CPU"ã¨ããæå®ãèã§ããã
0ã¨5ã§çµããffnå
ã®å±¤ãCPUã«ä¹ãã¾ãã

ä»åã¯Redditã«æ¸ãã¦ãã£ãæå®ã使ã£ã¦ããã®ã ãã©ã層ã®ååã確èªãããã¨ãã¯æ£è¦è¡¨ç¾ã§.*ãæå®ããã°å
¨é¨CPUã«ä¹ãã®ã§ç¢ºèªã§ãããã
http://localhost:5001ã«ã¢ã¯ã»ã¹ãã¦ãbertã¨gptã®éãã¯ãã¨èãã¦ã¿ã¾ãã

17.55tok/secï¼12%éããªãã¾ãããã

ã¡ã¢ãªæ¶è²»ã¯ãã®ãããã

è½ã¨ããã¨ãã«2.2GB使ã£ã¦ããã®ã§ã11.4GBã»ã©æ¶è²»ãã¦ã¾ããããã¯LM Studioã§36ã¬ã¤ã¤ã¼èªã¿è¾¼ãã ã¨ãã¨åãã
Llama4 Scoutã®Q2_KãLM Studioã§16ã¬ã¤ã¤ã¼ãGPUã«ãªããã¼ãããå ´åã¨KoboldCPPã§--overridetensors "blk\.([0-9]*[0124578])\.ffn_.*_exps\.=CPU"ã¨ãã¦FFNã ã2/3ã»ã©CPUã«æ®ããå ´åã§ã¯ã4.1tok/secã ã£ãã®ã4.9tok/secã¨20%éããªãã¾ããã


ãã ãæã£ããã广ãã§ã¦ãªãã®ã¯ããã¡ã®CPUãã¡ãã£ã¨å¼±ãããã§ã¯ãªããã¨ãå¼·ãCPUãªããã£ã¨å¹æãåºãã¨æãã¾ãã
Qwen3 32Bã§è©¦ãããæ§è½åä¸ã§ããªãã£ããã©ãCPUãå¼·ããã°ãããªãã«å¹æãåºããã
ä½ããã£ã¦ããã®ã
ã§ã¯ä½ããã£ã¦ããã®ãè¦ãããã«LLMã®åºæ¬æ§é ã確èªãã¦ã¿ã¾ãããã
ãã¾ã®LLMã¯ãã©ã³ã¹ãã©ã¼ãã¨ããæ§é ããã¼ã¹ã«ãã¦ãã ããããããªæãã«ãªã£ã¦ããä½ç½®ã¨ã³ã³ã¼ãã£ã³ã°(Posional Encoding)ããFeed Forwardã¾ã§ã§ä¸å±¤ã§ã ãããQwen 30B A3Bãªã48層ãQwen 32Bãªã64層ã¨ãã風ã«ãªã£ã¦ãã

ã§ãLM Studioãã¯ãããLLMã®å®è¡ç³»ã®è¨å®ã§ã¯ã層åä½ã§GPUã«ã©ãã ãä¹ããããéã«CPUã«ã©ãã ãæ®ããã¨ããã®ãè¨å®ããããã«ãªã£ã¦ãã
ã§ãã層å
¨ä½ã§æ±ºããããããªãã¦ã層ã®ãªãã®å½¹å²ã«ãã£ã¦CPUã§ãå¹çåã§ããããGPUãããªãã¨ã ããã£ã¦æ±ºã¾ããã§ãCPUã§ãå¹çåã§ããã¨ããã¯CPUã«æ®ãã¦ãGPUã®ã¡ãªããããã¨ããã¯ãªãã¹ãGPUã«ä¹ããã»ãããããã§ã¯ãã£ã¦è©±ã§ããã
ãªããããããã®ã
ããããªããããããã®ããã£ã¦è¦ãã®ã«ã¯ãå®éã®ã³ã¼ãè¦ãã®ãããã¨æãã¾ãã
ã¨ãããã¨ã§ãllama2.cãJavaã§æ¸ãç´ãããã¤ããã¼ã¹ã«ã
https://gist.github.com/kishida/05656bfcbe840f269784f7dbbee5928e
LLMã®å¦çãè¦ãã®ã¯forwardã¡ã½ããã
https://gist.github.com/kishida/05656bfcbe840f269784f7dbbee5928e#file-llama-java-L300
ã¾ã徿®µã«ãªãFeedForwardãè¦ã¦ã¿ã¾ããä»åCPUã«ä¹ãããã¨ããã®ã¯ãã®é¨åã§ãã
rmsnorm(s.xb, x, w.rms_ffn_weight[l], dim); // Now for FFN in PyTorch we have: self.w2(F.silu(self.w1(x)) * self.w3(x)) // first calculate self.w1(x) and self.w3(x) matmul(s.hb, s.xb, w.w1[l], dim, hidden_dim); matmul(s.hb2, s.xb, w.w3[l], dim, hidden_dim); // SwiGLU non-linearity for (int i = 0; i < hidden_dim; i++) { // çç¥ } // final matmul to get the output of the ffn matmul(s.xb, s.hb, w.w2[l], hidden_dim, dim);
SwiGLUã®ã¨ããã¯çç¥ãã¦ã¾ãã1éã«ã¼ãã§ããrmsnormã1éã«ã¼ãã«ãªã£ã¦ã¾ãã1éã«ã¼ãã¯åºæ¬çã«æéãããããªãã®ã§ãé«éåã®å¿ è¦æ§ãèãã§ãã ãã¨ã¯matmulã§ããFFNã®å¦çæéã¯matmulé¨åã«ãããã¾ãã
ãã®matmulã¯ãããªæãã
static void matmul(float[] xout, float[] x, FloatBuffer ws, int n, int d) { MemorySegment w = MemorySegment.ofBuffer(ws); IntStream.range(0, d).parallel().forEach(i -> { FloatVector val = FloatVector.zero(SPECIES); for (int j = 0; j < n; j+=SIMD_SIZE) { FloatVector a = FloatVector.fromMemorySegment( SPECIES, w, (i * n + j + SIMD_SIZE) * FLOAT_SIZE, ByteOrder.LITTLE_ENDIAN); FloatVector b = FloatVector.fromArray(SPECIES, x, j + 0*SIMD_SIZE); val = a.fma(b, val); } xout[i] = val.reduceLanes(VectorOperators.ADD); }); }
ç´°ããã¨ããã¯ç½®ãã¦ããã¦ãIntStreamã§parallelã¨ãã¦ãã«ãã¹ã¬ããåãã¦ãã¨ããã¨ããã®ä¸ã«ã«ã¼ãããã£ã¦FloatVectorã使ã£ã¦AVXãªã©SIMDå½ä»¤ã使ãããã«ãã¦ããã¨ã ãè¦ã¦ãã ããã
ã¤ã¾ããã¹ã¬ãããåããã³ã¢æ°ããããªãã«ãã£ã¦AVXã®ããã«1å½ä»¤ã§è¤æ°ã®ãã¼ã¿ãå¦çã§ããã°ãCPUã§ãéãå¦çãã§ãã¾ãã
䏿¹ã§ãã«ããããã¢ãã³ã·ã§ã³ã¯ãããªæãã§ããã
// multihead attention. iterate over all heads final var fl = l; IntStream.range(0, p.n_heads).parallel().forEach(h -> { int qpos = h * head_size; int kvpos = h / kv_mul * head_size; float[] att = s.att[h]; for (int t = 0; t <= pos; t++) { float score = 0; FloatVector val = FloatVector.zero(SPECIES); for (int i = 0; i < head_size; i+=SIMD_SIZE) { FloatVector a = FloatVector.fromArray(SPECIES, s.q, qpos + i); FloatVector b = FloatVector.fromArray(SPECIES, s.key_cache[fl][t], kvpos + i); val = a.fma(b, val); } score = val.reduceLanes(VectorOperators.ADD); score /= head_aqrt; // save the score to the attention buffer att[t] = score; } ã»ã»ã»
IntStreamã®parallelã§ãã«ãã¹ã¬ããåãã¦ãå é¨ã«FloatVectorã使ã£ãã«ã¼ããããã®ã¯matmulã¨ä¼¼ã¦ãã®ã ãã©ãFloatVectorã使ã£ãã«ã¼ããã«ã¼ãã§å²ã¾ãã¦ãå ¨ä½ã§3éã«ã¼ãã«ãªã£ã¦ã¾ãã
ããã¦ãçãä¸ã®ã«ã¼ãã¯ç¹ã«ãã¼ãã¦ã§ã¢ã§ã®é«éåãããã¦ãªãã§ããCPUã ã¨ãããé«éåããä»çµã¿ããªãã
Intel AMXã¨ããããã©4ä¸ä»£Xeonã«ããããæè¼ãããã¨ããã§ãæ®åãã¦ãªãã使ããã¨LLMãéããªãããã§ãã
インテルの AI 対応 AMX CPU アクセラレータのテスト結果について | Google Cloud 公式ブログ
䏿¹ã§GPUã ã¨3éã«ã¼ããéããããã¨ãã§ãã¾ãã
GPUå¦çã®å
±éãã¬ã¼ã ã¯ã¼ã¯ã§ããOpenCLã®èª¬æã«æ¬¡ã®ããã«æ¸ãã¦ã¾ãã
è§£ãããåé¡ã«ã¯å ¨ã¦ãç´ç·ç¶ããã¥ã¼ãç¶ãå¹³é¢ç¶ã®ããã«ããç¨åº¦ã®æ¬¡å æ§ãåå¨ãã¦ããã OpenCLã§ã¯æå¤§3次å ã¾ã§ãæå®ãã¦ã«ã¼ãã«ãå±éããã
ããã§ããµãã¨3éã«ã¼ããGPUã§å¹çããå¦çããã½ã¼ã¹ãåºããã¨ãããã ãã©ããã£ã¼ãã©ã¼ãã³ã°ãGPU使ã£ã¦éããããã¨ããå¦çã§ã¯ãã¡ããã¨3éã«ã¼ãã®å¦çãæ¸ãã¦ãªãã¦é«éåã§ãã¦ãªãã£ãã
https://github.com/kishida/neuralnet/blob/use_jocl/src/main/resources/kernels/convolution_forward.cl#L15
次ã®ããã«iã®ã«ã¼ãã¨jã®ã«ã¼ããGPUã®ä¸¦ååã«ä»»ããããã«ããã¨éããªãã¯ãã
int fxy = get_global_id(0); int i = get_global_id(1); int j = get_global_id(2);
![[å¢è£æ¹è¨]GPUãæ¯ããæè¡ ââè¶
並åãã¼ãã¦ã§ã¢ã®å¿«é²æ[æè¡åºç¤] (WEB+DB PRESS plus) [å¢è£æ¹è¨]GPUãæ¯ããæè¡ ââè¶
並åãã¼ãã¦ã§ã¢ã®å¿«é²æ[æè¡åºç¤] (WEB+DB PRESS plus)](https://m.media-amazon.com/images/I/51Jcf3fV-BL._SL500_.jpg)