DeepSeek-R1ã«ãæ¡ç¨ãããLLMãã¥ã¼ãã³ã°ã®ããã®å¼·åå¦ç¿ææ³ GRPOï¼Group Relative Policy Optimizationï¼ã«ã¤ãã¦èãããã¨ãã¾ã¨ãã¾ãã
- GRPO: DeepSeek-R1ã®å¼·åå¦ç¿ãã¡ã¤ã³ãã¥ã¼ãã³ã°ææ³
- åæææ³ï¼TRPO/PPO
- GRPOã¨PPOã®å·®åï¼â ã¢ããã³ãã¼ã¸ç®åºã¨â¡åç §ã¢ãã«ããã®KLè·é¢å¶ç´
- 夿´ç¹â ï¼ ã¢ããã³ãã¼ã¸Aã®ç®åºæ¹æ³
- 夿´ç¹ â¡ï¼ åç §ã¢ãã«ï¼SFTã¢ãã«ï¼ããã®KLè·é¢å¶ç´
- ã¾ã¨ã
- 次ï¼GSPOï¼Group Sequence Policy Optimizationï¼
GRPO: DeepSeek-R1ã®å¼·åå¦ç¿ãã¡ã¤ã³ãã¥ã¼ãã³ã°ææ³
GPT-o1ã¢ãã«ã«å¹æµããæ§è½ã示ã DeepSeek-R1 ã話é¡ã¨ãªã£ã¦ãã¾ããDeepSeek-R1ã¯åç¨å¯è½ãªãªã¼ãã³ã¦ã§ã¤ãã¢ãã«ãã¤ç ´å£çã«å®ä¾¡ãªAPIãå©ç¨å¯è½ã§ãããããæ§ã ãªLLMã¦ã¼ã¹ã±ã¼ã¹ã«ããã¦å¤§ããªå½±é¿ãäºæ³ããã¾ãã
ãã¯ãã«ã«ãªè¦³ç¹ã§ã¯ãDeepSeek-R1ã®RLãã¥ã¼ãã³ã°ã§ã¯GRPOï¼Group Relative Policy Optimizationï¼ã¨ããPPOãLLMãã¥ã¼ãã³ã°ã«ç¹åãããå¼·åå¦ç¿ææ³ãææ¡ããã¦ãããã¨ãè峿·±ãç¹ã¨ãªã£ã¦ãã¾ã*1ã


GRPOã«ã¤ãã¦ãPPOããã®æå¤§ã®å¤æ´ç¹ã¯ã¢ããã³ãã¼ã¸ï¼Aï¼ãã¨ãã½ã¼ãå ±é ¬ï¼rï¼ããç´æ¥ç®åºãããã¨ã«ããç¶æ 価å¤V(s)ã®é¢æ°è¿ä¼¼ãä¸è¦ã¨ãããã¨ã§ãããã®å¤æ´ã«ããã徿¥ã®PPOã§ã¯RLãã¥ã¼ãã³ã°æã«æ¹ç颿°ï¼ï¼LLMï¼ã¨ç¶æ 価å¤é¢æ°V(s)ã®ï¼ã¤ã®ãããã¯ã¼ã¯ãåæè¨ç·´ããå¿ è¦ããã£ãã¨ããããGRPOã§ã¯æ¹ç颿°ï¼ï¼LLMï¼ã®è¨ç·´ã ããè¡ãã°ãããªãã¾ãããè¨ç·´ãã¹ããããã¯ã¼ã¯ãä¸ã¤ã«ãªã£ããã¨ã¯è¦æ±è¨ç®éã®æ¸å°ã¯ãã¡ãããå¦ç¿ã®å®å®æ§åä¸ã«ã大ããå¯ä¸ãã¦ããã¨èãããã¾ãã
ããã«ãå¦ç¿ãå®å®ãã¦ããã¨å¤§ããªã¢ãã«ã¨å¤§éãã¼ã¿ã§ã®è¨ç·´ã容æã«ãªããããããããã¯éæ¥çã«æ§è½åä¸ã«ãå¯ä¸ãã¦ããã¨æããã¾ãã
åæææ³ï¼TRPO/PPO
GRPOã®åæææ³ã§ããTRPOã¨PPOã«ã¤ã㦠John Schulmanの講義資料 ããæç²ãã¦ç°¡åã«èª¬æãã¾ãã
TRPO: Trust Region Policy Optimization

æ¹çå¾é å®çã¯å ±é ¬ãæå¤§åããããã®æ¹çãã©ã¡ã¼ã¿ã®å¾é æ¹åãæãã¦ããã¾ãããé©åãªæ´æ°ãµã¤ãºã«ã¤ãã¦ã¯ä½ãæãã¦ãããªããããã°ãã°å¦ç¿ãä¸å®å®åãã¾ããããã§TRPOã§ã¯æ¹çæ´æ°ã«ããã¦ãæ´æ°å¾æ¹çÏÎ¸ã¨æ´æ°åæ¹çÏθ_oldã«KLè·é¢ãå¶ç´é ã¨ãã¦ä¸ãããã¨ã§æ¥µç«¯ãªãã©ã¡ã¼ã¿æ´æ°ãåé¿ãã¾ãã

TRPOã§ã¯ãµã³ãã«åéâ è¤æ°åã®å¾é æ´æ° â ãµã³ãã«åéâ è¤æ°åã®å¾é æ´æ° ãç¹°ãè¿ãããããã¼ã¿åéãè¡ã£ãæ¹çÏθ_old ã¨ç¾å¨æ¹çÏθãå¿ ãããä¸è´ãã¾ããã ããã§ä¸è¨ã®ããã«éç¹ãµã³ããªã³ã°ã«ãã£ã¦ç®ç颿°ã®è£æ£ãè¡ãå¿ è¦ãããã¾ãã
[1502.05477] Trust Region Policy Optimization
PPO: Proximal Policy Optimization
TRPOã¯æ¯åã®å¾é æ´æ°ãã¨ã«ã©ã°ã©ã³ã¸ã¥ä¹æ°æ³ã§å¶ç´ä»ãæé©ååé¡ãã¾ããã«è§£ãã®ã§è¨ç®éãé常ã«å¤§ãããªã£ã¦ãã¾ããããå¤§è¦æ¨¡ãªãã¥ã¼ã©ã«ãããã¯ã¼ã¯ã«é©ç¨ãããã¨ãã§ãã¾ããããã®åé¡ã®è§£æ±ºã®ããã«ç°¡æåãããTRPOã¨ãã¦ææ¡ãããå¾ç¶ææ³ãPPOã§ãã
PPOã§ã¯importance ratioã大ãããªããããï¼ãããã¯å°ãããªããããï¼å ´åã«ã¯ã¯ãªãããã¦ãã¾ãã¨ããã¢ã«ã´ãªãºã ã§æé»çãªKLè·é¢å¶ç´ãä¸ããã¨ã§ãTRPOã®ç®çã§ããæ¥µç«¯ãªãã©ã¡ã¼ã¿æ´æ°é²æ¢ãå®ç¾ãã¾ãã

[1707.06347] Proximal Policy Optimization Algorithms
ãã®ä»åèè³æï¼
Introduction - Hugging Face Deep RL Course
ハムスターでもわかるTRPO ①基本編 - どこから見てもメンダコ
ハムスターでもわかるProximal Policy Optimization (PPO)①基本編 - どこから見てもメンダコ
GRPOã¨PPOã®å·®åï¼â ã¢ããã³ãã¼ã¸ç®åºã¨â¡åç §ã¢ãã«ããã®KLè·é¢å¶ç´
[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
ç®ç颿°ãè¦ãã¨ãGRPOã¨PPOã®ä¸»è¦ãªå·®åã¯2ç¹ã®ã¿ã§ãããã¨ããããã¾ãã
PPOã®ç®ç颿°ï¼

GRPOã®ç®ç颿°ï¼

â ã¢ããã³ãã¼ã¸Aã®ç®åºæ¹æ³ï¼
徿¥ã®PPOã§ã¯æ¹ç颿°ï¼ï¼LLMï¼ã¨ã¯å¥ã«è¨ç·´ãããç¶æ
価å¤V(s)ã®é¢æ°è¿ä¼¼ãç¨ãã¦ã¢ããã³ãã¼ã¸ãç®åºãã¦ãã¾ããããGRPOã§ã¯å ±é
¬rã®ã¿ã§ã¢ããã³ãã¼ã¸ãç®åºããããç¶æ
価å¤V(s)ã®é¢æ°è¿ä¼¼ãä¸è¦ã¨ãªã£ã¦ãã¾ãã
â¡ åç
§ã¢ãã«ï¼SFTã¢ãã«ï¼ããã®KLè·é¢å¶ç´ã®ç½®ãå ´æï¼
徿¥ã®PPOã§ã¯åç
§ã¢ãã«ï¼SFTã¢ãã«ï¼ããã®KLè·é¢å¶ç´ã¯å ±é
¬rã«å«ãããã¦ãã¾ããããGRPOã§ã¯æç¤ºçã«ç®ç颿°å
ã«ããã«ãã£é
ã¨ãã¦è¿½å ããã¦ãã¾ãã
è¨ãæããã¨ä¸è¨2ç¹ä»¥å¤ã¯PPOã¨å ¨ãå¤ãããªãã®ã§ãããããã§ãâ ã®ã¢ããã³ãã¼ã¸ç®åºæ¹æ³å¤æ´ã«ã¤ãã¦ã¯ã·ã³ãã«ãªããå¼·ãç´å¾æã¨é«ãå®ç¨æ§ãåããã¨ã¬ã¬ã³ããªã¢ã¤ãã¢ã§ããã¨æãã¾ããâ¡ã¯è¨ç®éæ¸ã£ã¦ãããããããã
夿´ç¹â ï¼ ã¢ããã³ãã¼ã¸Aã®ç®åºæ¹æ³
REINFORCEï¼ ä¾¡å¤é¢æ°è¿ä¼¼ãªãæ¹çå¾é æ³
æ¹çå¾é å®çãããè¡åé¸æãæ¹ç颿°Ïθã«å¾ãã¨ãã«æå¾ ãããç´¯ç©å ±é ¬ J(θ) ã®å¾é ãæ¬¡å¼ããã¢ã³ãã«ã«ãæ¨å®ãããã¨ãã§ããã
ããã§ãQ(s_t, a_t)ã¯ç¶æ è¡å価å¤ã§ãããç¶æ s_tã«ãããè¡å鏿a_tã®è¯ããè¡¨ãææ¨ã§ãããã¾ããb(s_t)ã¯aã«ä¾åããªãä»»æã®é¢æ°ã§ãããã¼ã¹ã©ã¤ã³é¢æ°ã¨å¼ç§°ãããããã¼ã¹ã©ã¤ã³é¢æ°ã¯å¾é æ¨å®ï¼âJï¼ã®æå¾ å¤ã«ã¯å½±é¿ããªããé©åã«è¨å®ãããã¨ã§å¾é æ¨å®ã®åæ£ã使¸ãããã¨ãã§ãããä»»æé¢æ°ã§ããã®ã§ä¾ãã°ä¸æ§ã«b(s)=0 ã¨ãããã¨ãã§ããããE[Q(s, a) - b(s)] = 0ã¨ãªãããb(s)ãè¨è¨ã§ããã¨å¾é æ¨å®ã®åæ£ã使¸ããåææ§ãåä¸ããã
Qã¯æ§ã
ãªæ¹æ³ã§æ¨å®ãããã¨ãã§ãããããã£ã¨ãåç´ãªã®ã¯æå»t以éã®å ±é
¬åï¼ ï¼ãç¶æ
è¡å価å¤Qã®æ¨å®å¤ã¨ããæ¹æ³ã§ããããã®æ¹æ³ã¯ä¸è¬ã«REINFORCEã¨å¼ç§°ãããã
ãã¼ã¹ã©ã¤ã³é¢æ°b(s)ã¨ãã¦ã¯è¨ç®ã®å®¹æããããªã¿ã¼ã³ï¼ã¨ãã½ã¼ãåè¨å ±é ¬ï¼ã®æ¨å®å¹³åå¤ãæ¡ç¨ããããã¨ãå¤ããããããã®æ¹æ³ã¯ï¼éè¦ï¼å ±é ¬ãã¨ãã½ã¼ãã®æçµã¹ãããã«ã®ã¿çºçããç¹æ®ãªç°å¢ãé¤ããæå»t=0以å¤ã§ã¯E[Q(s, a) - å¹³åãªã¿ã¼ã³] â 0 ã§ããããããã»ã©è¯ããã¼ã¹ã©ã¤ã³è¨è¨ã§ã¯ãªãã
ãã®ããã«REINFORCEã§ã¯ç¶æ è¡å価å¤Q(s, a)ã®ç®åºã«ããã¦ããã¼ã¹ã©ã¤ã³é¢æ°b(s)ã«ããã¦ã価å¤ã®é¢æ°è¿ä¼¼ãè¡ããå ±é ¬ããç´æ¥æ¨å®ãããããè¨ç·´ãããããã¯ã¼ã¯ã¯æ¹ç颿°ã®ã¿ã¨ãªãã
PPOï¼Actor-Criticï¼ï¼ 価å¤é¢æ°è¿ä¼¼ããæ¹çå¾é æ³
ãã¼ã¹ã©ã¤ã³é¢æ°ã®èªç¶ãªé¸æè¢ã®ä¸ã¤ã¯ç¶æ 価å¤V(s)ã®é¢æ°è¿ä¼¼ã§ããã
æé©æ¹çã«ãã㦠E[Q(s, a) - V(s)] = 0ã¨ãªããããV(s)ã¯è¯ããã¼ã¹ã©ã¤ã³é¢æ°ã¨ãªãã
Q(s, a) - V(s)ã¯ä¸è¬ã«ã¢ããã³ãã¼ã¸é¢æ°Aã¨å¼ç§°ããããç´æçã«ã¯ã¢ããã³ãã¼ã¸é¢æ°ã¯ç¶æ sã«ãããè¡åaã®ç¸å¯¾çãªä¾¡å¤ã表ç¾ãã¦ããã¨çè§£ã§ããã価å¤ãç¸å¯¾åãããã¨ã«ãããããç¶æ³Sã§ã©ã®è¡åAãé¸ã¶ã¹ããªã®ãã強調ãããã¨ãå¯è½ã¨ãªãã
V(s)ã®é¢æ°è¿ä¼¼ãç¨ããã¢ããã³ãã¼ã¸é¢æ°ã«åºã¥ãæ¹çå¾é æ³ã¢ã¼ããã¯ãã£ã¯ä¸è¬ã«Actor-Criticã¨å¼ç§°ããããPPOãã¾ãActor-Criticç³»ææ³ã®ä¸ã¤ã§ããã
Actor-Criticã§ã¯æ¹ç颿°ã¨ä¾¡å¤é¢æ°ã®ï¼ã¤ã®ãããã¯ã¼ã¯ãå ±é²åçã«è¨ç·´ããå¿ è¦ããå¦ç¿ä¸å®å®æ§ã®èª²é¡ããããã®ã®ãå¤ãã®å ´åã§ã¯ããã«åªãæ§è½åä¸ãå¾ãããããããã¾ãã¾ãªå¼·åå¦ç¿ã¿ã¹ã¯ã«ããã¦æ¹çå¾é ç³»ææ³ã®ä¸»æµã¢ã¼ããã¯ãã£ã¨ãªã£ã¦ããã
ãªããï¼±(s, a)ãã©ã®ããã«æ¨å®ãããã«ããã¢ããã³ãã¼ã¸é¢æ°ã«ããªã¨ã¼ã·ã§ã³ãåå¨ããããã¨ãã° Q(s, a) = r_t + V(s_t+1)ããç¶æ è¡å価å¤Qãç¶æ 価å¤Vããæ¨å®ããæ¹æ³ããããããã¯1ã¹ãããã¢ããã³ãã¼ã¸ã¨å¼ç§°ãããï¼ãã¶ãï¼ã
PPOã§ã¯GAE(Generalized Advantage Estimation) ã¨ããããã精緻ãªã¢ããã³ãã¼ã¸è¨ç®ææ³ãæ¡ç¨ããã¦ãããä¸è¨ã®1ã¹ãããã¢ããã³ãã¼ã¸ã¨æ¬è³ªçãªéãã¯ãªãã
GRPOï¼ ã¹ã±ã¼ãªã³ã°ãããREINFORCE
ãåçå®äºæã«åãã¦å ±é ¬ãçºçãããã¨ããLLMå ±é ¬ã¢ãã«ã®ç¹æ§ãéã¿ãã¨ã価å¤ã®é¢æ°è¿ä¼¼ã¯ããã¦REINFORCEã«è¿ãæ¹æ³ã§ã¢ããã³ãã¼ã¸ç®åºããã®ãå¹çè¯ãã®ã§ã¯ï¼ã¨ææ¡ãã¦ããã®ãGRPOã§ããã¨å人çã«çè§£ãã¦ãã¾ãã
ã¾ãã¯REINFORCEã®æ´æ°å¼ã«ç«ã¡æ»ãã¾ãã
LLMã®å ±é ¬ã¢ãã«ã®ããã«ãå ±é ¬ãæçµã¹ãããï¼åçå®äºæï¼ã®ã¿ã§çºçããå ´åã«ã¯æå»T以å¤ã§ã®å ±é ¬r_tã0ã¨ãªããããæå»t以éã®å ±é ¬åã¯éå§æå»tã«ä¾åããr_Tã¨ãªãã¾ããããªãã¡ã
ããã§æçµã¹ãããã®å³æå ±é
¬r_Tã¯ä¸ãããã質å(question)ã¸ã®åçå®äº(output)ã«å¯¾ããå ±é
¬ã¢ãã«ã®åºåã§ããã®ã§ã ã¨è¡¨è¨ãã¦REINFORCEã®æ´æ°å¼ãæ¸ãç´ãã¾ãã
次ã«ãã¼ã¹ã©ã¤ã³é¢æ°b(s_t)ã®è¨è¨ãèãã¾ãã
åè¿°ããã¨ããã ã¨ãªãããã«b(s)ãè¨è¨ãããã¨ãã§ããã°å¾é
æ¨å®ã®åæ£ã使¸ããå¦ç¿ãå®å®åããããããb(s)ã¨ãã¦æ¨å®ãã¹ãã¯
ã®æå¾
å¤ã§ããããã®ããã®ãã£ã¨ãç°¡åãªæ¹æ³ã¯ã¢ã³ãã«ã«ãæ¨å®ã§ããããªãã¡ã1ã¤ã®è³ªå(question)ã«ã¤ãã¦å¤æ°ã®åç(output)ã®ãµã³ããªã³ã°ãè¡ããã·ã³ãã«ã«å¹³åå¤ãã¨ããã¨ã§r_out|questionã®æå¾
å¤ãæ¨å®ãããã¨ãã§ãã¾ãã

ãã質åqããGåã®åçã°ã«ã¼ãï¼o1, o2 ... oGï¼ããµã³ããªã³ã°ãããã¨ããååçã«ã¤ãã¦ã®å ±é ¬r_iãç¨ãã¦REINFORCEãæ¬¡ã®ããã«æ¸ãç´ããã¨ãã§ãã¾ãã
å½ç¶ãE[r - mean(r1, ... , rG)]=0ã§ããã®ã§ããã¯ãããã¼ã¹ã©ã¤ã³é¢æ°è¨è¨ã§ããã¨è¨ãã¾ããããã«ãåçã°ã«ã¼ãã®å ±é ¬ã«ã¤ãã¦ã®æ¨æºåå·®ãç¨ãã¦ã¹ã±ã¼ãªã³ã°ãããã¨ã§GRPOã®ã¢ããã³ãã¼ã¸é¢æ°ãå¾ããã¨ãã§ãã¾ãã
ãã®ããã«ãLLMå ±é ¬ã¢ãã«ã®æ§è³ªãéã¿ã¦ãã¼ã¹ã©ã¤ã³é¢æ°b(s)ã¨ãã¦ç¶æ 価å¤V(s)ã®é¢æ°è¿ä¼¼ã§ã¯ãªããå ±é ¬æå¾ å¤ã®ã¢ã³ãã«ã«ãæ¨å®ãæ¡ç¨ããã®ãGRPOã§ãããã¼ã¹ã©ã¤ã³é¢æ°ã®è¨è¨ã¨ãã観ç¹ããã·ã³ãã«ãã¤ç´å¾æã®ããã¨ã¬ã¬ã³ããªè¨è¨ã¨ãªã£ã¦ãããã¨ããããã¾ãããã¤ã³ãã¯ã©ã³ãã ãªè³ªå群ã«å¯¾ãã¦å¤æ°ã®åçããµã³ããªã³ã°ããã®ã§ã¯ãªããä¸ã¤ã®è³ªåã«å¯¾ãã¦å¤æ°ã®åçããµã³ããªã³ã°ãããã¨ã«ãããé«ã精度ã§ã®æå¾ 夿¨å®ãå¯è½ã¨ãªããã¨ã§ãããããGroup Relative Policy Optimizationããæä»¥ã¨ãªã£ã¦ãã¾ãã
GRPOã®ã¢ããã³ãã¼ã¸ç®åºæ¹æ³ã§ããã°å¾æ¥ã®PPOã¨ã¯ç°ãªãç¶æ 価å¤V(s)ã®é¢æ°è¿ä¼¼ãä¸è¦ã§ãããããï¼ã¤ã®ãããã¯ã¼ã¯ãå ±é²åãããå¿ è¦æ§ããã«å¦ç¿ãä¸å®å®ãªActor-Criticã¢ã¼ããã¯ãã£ãåé¿ãããã¨ãã§ãã¾ãã
夿´ç¹ â¡ï¼ åç §ã¢ãã«ï¼SFTã¢ãã«ï¼ããã®KLè·é¢å¶ç´

徿¥ã¯åç §ã¢ãã«å¶ç´ã¯å ±é ¬ã«å«ãããã¦ãã
LLMã®å¼·åå¦ç¿ãã¥ã¼ãã³ã°ã§ã¯ã¢ãã«ãpretrainingæã®è¨æ¶ã失ããã¨ãé²ããããè¨ç·´åã¢ãã«ï¼pretrainedã¢ãã« / SFTã¢ãã«ï¼ã¨ã®KLè·é¢ãé¢ããããªãããã«å¶ç´ãä¸ãããã¨ãä¸è¬çã§ãããã®KLè·é¢å¶ç´ã¯å¾æ¥ï¼Instruct-GPTãªã©ï¼ã¯å ±é ¬ã®ä¸é¨ã¨ãã¦æé»çã«åãè¾¼ã¾ãã¦ãã¾ããã

䏿¹ãGRPOã§ã¯å¶ç´é D_klãç®ç颿°ã«æç¤ºçã«çµã¿è¾¼ã¾ãã¦ãã¾ãã

ãã®å¤æ´çç±ã®ä¸ã¤ã¯ãã¢ããã³ãã¼ã¸é¢æ°ã®è¨ç®ã°ã©ããã·ã³ãã«ã«ãã¦è¨ç®éãæ¸ãããã¨ãçã£ãã®ã ã¨æããã¾ããã¾ããä¸è¿°ããããã«GRPOã®ãã¼ã¹ã©ã¤ã³é¢æ°ã¯ãåçå®äºæã«åãã¦å ±é ¬ãçºçãããã¨ããLLMå ±é ¬ã¢ãã«ç¹æã®æ§è³ªãåæã«ãã¦ãããããå ±é ¬ã«KLå¶ç´ããã«ãã£ãå«ãããã¨ã§éä¸å ±é ¬ãçºçããã®ãå«ã£ãã®ã§ã¯ãªããã¨æãã¾ãã
KLè·é¢ã®ã¢ã³ãã«ã«ãæ¨å®
KLè·é¢ã¯ã¢ã³ãã«ã«ãæ¨å®ã«ãããµã³ããªã³ã°ãã¼ã¹ã§ç®åºããã®ã§ãããè«æã®å¼ãè¦æ £ããªãæãã«ãªã£ã¦ãã¾ãã

åæã¨ãã¦ãÏθã¨Ïrefã®KLè·é¢ã¯ãã£ã¨ãåç´ã«ã¯ä»¥ä¸ã®å¼ã«å¾ããµã³ããªã³ã°ãã¼ã¹ã§æ¨å®ãããã¨ãã§ãã¾ãã
ãããããã®æ¨å®å¤ã¯ä¸åã§ã¯ãããã®ã®åæ£ã大ããå®å®ãã¾ãããããã§ 制御変量法 ã¨ããããªãã¯ãç¨ããã¨åæ£ãæ¸ãããã¨ãã§ãã¾ããå¶å¾¡å¤éæ³ãç¨ããKLè·é¢ã®ã¢ã³ãã«ã«ãæ¨å®ã«ã¤ãã¦çµè«ã ãè¿°ã¹ãã¨ã以ä¸ã®ç®åºå¼ã«ãã£ã¦è¯å¥½ãªæ¨å®å¤ãå¾ããã¨ãã§ãã¾ãã

詳細ã¯John Schulmanの記事 ”Approximating KL Divergence”ãåç §ã
ã¾ã¨ã
- GRPOã¯ãåçå®äºæã«ã®ã¿å ±é ¬ãçºçãããã¨ããLLMå ±é ¬ã¢ãã«ç¹æã®æ§è³ªãåæã«ãREINFORCEã®ãã¼ã¹ã©ã¤ã³é¢æ°b(s)ããã¾ãè¨è¨ãããã¨ã§PPOã«ããã¦ç¶æ 価å¤V(s)ã®é¢æ°è¿ä¼¼ãä¸è¦ã¨ãã
- ç¶æ 価å¤V(s)ã®é¢æ°è¿ä¼¼ãä¸è¦ã¨ãªã£ããã¨ã§è¨ç®éãæ¸ãã®ã¯ãã¡ãããå¦ç¿ã®å®å®æ§ãåä¸ãçµæã¨ãã¦æ§è½åä¸ã«ã¤ãªãã£ãï¼ã¨æãããï¼
次ï¼GSPOï¼Group Sequence Policy Optimizationï¼
GRPOã®å¦ç¿å®å®æ§ãæ¹åããææ³ãQwen3ã«ã¦æ¡ç¨ã horomary.hatenablog.com
*1:ãªããGRPOã®ååºã¯DeepSeek-R1ã§ã¯ãªãDeepSeek-Math