æ¬è¨äºã¯ããã³ã¢ã¢ããã³ãã«ã¬ã³ãã¼2024 19æ¥ç®ã®è¨äºã§ãð
ããã«ã¡ã¯ï¼NTTãã³ã¢ ã¯ãã¹ããã¯éçºé¨ã®çå ã§ããæ¥åã§ã¯ãã«ã¹ã±ã¢é åã«ããããã¼ã¿åæãAIéçºãè¡ã£ã¦ãã¾ãã
ãã®è¨äºã§ã¯ãã¤ãºæ¨è«ã«ããæ©æ¢°å¦ç¿ã¨RStanãç¨ããåæä¾ããç´¹ä»ãã¾ãããã¼ã¿ãµã¤ã¨ã³ã¹åéã®æ¹ã«ã¯é¦´æã¿ã®ãã話ããããã¾ããããç§ã¯ããå¿ãã¦ãã¾ãã®ã§é ã®æ´çãå ¼ãã¦æ¸ãã¦ãããã¨æãã¾ãã

â»æ°å¼ãå´©ããæ¹ã¯ãæ°å¼ã®ä¸ã§å³ã¯ãªãã¯ãã¦ãMath Settings > Math Renderer > Common HTMLã¸è¨å®ãã夿´ãã ãã
- 1. ã¯ããã«
- 2. ãã¤ãºæ¨è«ã«ã¤ãã¦
- 3. å®éã«åããã¦ã¿ã
- 4. ãããã«
- åèæ¸ç±
1. ã¯ããã«
è¿å¹´ãAIã«é¢ããç ç©¶ã¯æ¥éã«é²æ©ããããããç£æ¥ã§æ´»ç¨ããã¦ãã¾ããã¨ãããçæAIã®çºå±ã«ããèªç¶ãªããã¹ãã»é«å質ãªåç»åãªã©ãç°¡åã«çæã§ããããã«ãªããä¸ã®ä¸ã«å¤§ããªã¤ã³ãã¯ããä¸ãã¾ããã
çæã¢ãã«ã®æ ¹å¹¹ã«ããèãæ¹ã¯ããã®åã®éãããã¼ã¿ã®çæããã»ã¹ãã¢ãã«åãããã¨ãã§ããè¨ãæããã°ããã¼ã¿ã®èå¾ã«ããæ³åã¨ãã¦ç¢ºççãªæ§é ãä»®å®ããå¾ããããã¼ã¿ããæ§é ãæ¯é ãããã©ã¡ã¼ã¿ãå¦ç¿ãããã¨ã§ãæ°ãããã¼ã¿ãçæå¯è½ã«ããã¨ããææ³ã§ãã
ãã¼ã¿ã®çæããã»ã¹ã確çåå¸ã«ãã£ã¦è¨è¿°ããææ³ãçµ±è¨ã¢ããªã³ã°ã¨å¼ã³ã¾ããç¾è±¡ã確ççã«ã¢ããªã³ã°ããå©ç¹ã¨ãã¦ãç¾å®ä¸çã«ãããä¸ç¢ºå®æ§ãèæ ®å¯è½ã課é¡ã«åãããæè»ãªã¢ããªã³ã°ãæ¯è¼çå°ãµã³ãã«ã§ãåççãªæ¨è«ãå¯è½ãçã ãæããããæ©æ¢°å¦ç¿ã«ããã¦éè¦ãªç«ã¡ä½ç½®ãå ãã¦ãã¾ãã
ä¸ã§ããã¤ãºçè«ã«åºã¥ããã¤ãºæ¨è«ã®ã¢ããã¼ãã¯é常ã«å¼·åã§ãã ã¾ããæ§ã ãªçµ±è¨ã»æ©æ¢°å¦ç¿ææ³ã¯ãã¤ãºçè«ã®æ çµã¿ã®ä¸ã§è¨è¿°ãããã¨ãã§ãããã®ä¸ã¤ã®æ çµã¿ãçè§£ãããã¨ã§ãæ§ã ãªææ³ãæè»ã«çµã¿åãããçã®å¹ åºãå¿ç¨ãå¯è½ã«ãªãã¾ãã
2. ãã¤ãºæ¨è«ã«ã¤ãã¦
å®éã®åæã«å ¥ãåã«ãã¤ãºæ¨è«ã«ã¤ãã¦ãã£ãã触ãã¦ããã¾ããï¼ãã¾ãå³å¯ã«ã¯èª¬æãã¦ãã¾ããã®ã§è©³ããç¥ãããæ¹ã¯å¾è¿°ã®åèæ¸ç±ããåç §ãã ããï¼
ãããããã¨ã¯ããã¼ã¿ãã¨ãã確çåå¸ã«å¾ã£ã¦çæãããã¨ä»®å®ãããã®åå¸ã®å½¢ç¶ãç¹å¾´ã¥ãããã©ã¡ã¼ã¿
ãæ±ãããã¨ã§ãã
ãæ±ãããã¨ãã§ããã°
ãçæããåå¸ã®ç¹å¾´ãç¥ããã¨ãã§ããæ°ãããã¼ã¿ã®çæãè¡ã£ãããæªç¥ã®å
¥åã«å¯¾ããäºæ¸¬ãè¡ããã¨ãã§ããããã«ãªãã¾ãããã¤ãºæ¨è«ã¯ãããå®ç¾ããä¸ã¤ã®æ¹æ³ã§ãã
ãã¤ãºæ¨è«èªä½ã¯ããã¤ãºã®å®çã«åºã¥ãè³ã£ã¦ã·ã³ãã«ãªç¢ºçåå¸ã®æ´æ°ããã»ã¹ã§ããã¾ãã¯åºæ¬ã¨ãªããã¤ãºã®å®çãå°å ¥ãã¦ããã¾ãã
ãã¤ãºã®å®ç
確ç夿°ã¨
ã«å¯¾ããåæåå¸ã
ã¨ããã¨ã
ã«é¢ããå¨è¾ºåå¸ã¨
ãä¸ããããã¨ãã®
ã®æ¡ä»¶ä»ãåå¸ã以ä¸ã®ããã«æ¸ãã¾ãã
- å¨è¾ºåå¸
â»ç¢ºç夿°ã颿£åã®å ´åã¯ã
- æ¡ä»¶ä»ãåå¸
ããã³
ããã以ä¸ã®ãã¤ãºå®çãå°ããã¨ãã§ãã¾ãã
- ãã¤ãºã®å®ç
ãã¤ãºã®å®çã¯åã«ç¢ºçã®åºæ¬å
¬å¼ã®å¤å½¢ã«ãã£ã¦å°åºãããå¼ã«éãã¾ãããããã®å¼ã¯é¢ç½ãæ§è³ªã示ãã¦ãã¾ãããåå ã
ãçµæã¨è§£éããã¨ããã¤ãºã®å®çã«ããã°ãåå
ããçµæ
ãå¾ããã確ç
ããã¨ã«ãçµæ
ãå¾ãããæã®åå
ã®ç¢ºç
ãéç®ã§ãããã¨ãããã¨ã§ãã
ãã¤ãºæ¨è«
ãã®æ§è³ªãå©ç¨ããã°ãå®éã«è¦³æ¸¬ãããã¼ã¿ï¼çµæï¼ãããã®ãã¼ã¿ãçæãã確çã¢ãã«ã®ãã©ã¡ã¼ã¿
ï¼åå ï¼ãæ¨å®ãããã¨ãã§ãã¾ãããã®é¢ä¿ããã¤ãºã®å®çã«å½ã¦ã¯ãã¦è¨è¿°ããã¨ä»¥ä¸ã®ããã«ãªãã¾ãã
â»ãã¤ãºæ¨è«ã®æ çµã¿ã«ããã¦ã¯ãæ¨å®å¯¾è±¡ã®ãã©ã¡ã¼ã¿
ã確ç夿°ã¨ãã¦æ±ãã¾ãã
ããã§ãã尤度ã
ãäºååå¸ã
ãäºå¾åå¸ã¨å¼ã³ã¾ãã尤度ã¯ãã©ã¡ã¼ã¿
ãä¸ããããæã®ãã¼ã¿
ã®çºçããããã表ã颿°ã§ã課é¡ã«åããã¦èªç±ã«ã¢ãã«ãä»®å®ãããã¨ãã§ãã¾ããäºååå¸ã¯ãã©ã¡ã¼ã¿
ã®åå¸ã§ãããããããã©ã®ãããª
ãå¾ãããããããã¨ããäºåã«æã£ã¦ãã仮説ã表ç¾ãã¾ããäºå¾åå¸ããã©ã¡ã¼ã¿
ã®åå¸ã§ãããäºååå¸ã«å°¤åº¦ãä¹ç®ããããã¨ã§ã観測ãã¼ã¿ãèæ
®ãã¦æ´æ°ããã
ã®åå¸ã¨è§£éã§ãã¾ããã¾ãã忝ã®
ã¯å¨è¾ºå°¤åº¦ã¾ãã¯ã¨ããã³ã¹ã¨å¼ã°ãã¾ããå¨è¾ºå°¤åº¦ã¯
ã«é¢ãã¦å®æ°é
ã§ãããäºå¾åå¸ã®ç©åã
ã«ãªããã¨ãä¿è¨¼ããããã®æ£è¦å宿°ã¨ãªãã¾ãã
å¤ãã®å ´åãäºå¾åå¸ãè§£æçã«è¨ç®ãããã¨ã¯å°é£ã§ããããã§ãäºå¾åå¸ãæ±ããããã«ä¸»ã«ä»¥ä¸3ã¤ã®æ¦ç¥ãã¨ããã¨ãã§ãã¾ãã
å ±å½¹äºååå¸ å°¤åº¦é¢æ°ã«è¦å®ãã確çåå¸ã«å¯¾ãã¦ãäºååå¸ã«ç¹å®ã®ç¢ºçåå¸ãè¦å®ãããã¨ã§ãäºå¾åå¸ãäºååå¸ã¨ååã®åå¸ã«ãªãçµã¿åãããç¥ããã¦ããããããå ±å½¹äºååå¸ã¨ããã¾ãï¼ä¾ï¼ãã«ãã¼ã¤åå¸ã«å¯¾ãããã¼ã¿åå¸ãæ£è¦åå¸ã«å¯¾ããéã¬ã³ãåå¸ï¼ãäºååå¸ã«å ±å½¹äºååå¸ãè¨å®ããå ´åãäºå¾åå¸ãè§£æçã«è¨ç®ã§ãéå¸¸ã«æ±ããããã§ããããããäºååå¸è¨è¨ã®èªç±ãå¶éããããããå®ç¨æ§ã¯ä½ãã¨ããã¾ãã
MCMCï¼ãã«ã³ãé£éã¢ã³ãã«ã«ãæ³ï¼ MCMCã¯ãè¿ä¼¼çã«äºå¾åå¸ã«å¾ãä¹±æ°ãçºçãããææ³ã§ããããã«ããäºå¾åå¸ãç´æ¥æ±ãããã¨ã¯ã§ããªãã¦ããã·ãã¥ã¬ã¼ã·ã§ã³ããåå¸ã®ç¹å¾´ãç¥ããã¨ãã§ãã¾ããMCMCã®ã¢ã«ã´ãªãºã ã¨ãã¦ããã¡ããããªã¹æ³ã»ãã¤ã¹ãã£ã³ã°æ³ãããã®ãã¹ãµã³ããªã³ã°ãããããã«ããã¢ã³ã¢ã³ãã«ã«ãæ³ããªã©ã®ããã¤ãã®æ¹æ³ãåå¨ãã¾ãã
å¤åæ¨è« å¤åæ¨è«ã¯ãæ±ããã
ãå¥ã®æ°ããªé¢æ°
ã«ããè¿ä¼¼ããææ³ã§ãããã®æ¹æ³ã§ã¯ã2ã¤ã®ç¢ºçåå¸ã®å·®ç°ã®å¤§ããã表ç¾ããææ¨ã§ããKLãã¤ãã¼ã¸ã§ã³ã¹ãæå°åããè¿ä¼¼é¢æ°ã®ãã©ã¡ã¼ã¿
ã®æé©ååé¡ãè§£ãã¾ãã
ãã¡ãã®æé©åãé常ã¯è§£æè§£ãå¾ããã¨ãã§ããªããããå¾é æ³ãªã©ã®æ°å¤è¨ç®ãå©ç¨ããã¾ãã
è¿å¹´ã§ã¯è¨ç®æ©ã®çºéã«ä¼´ããMCMCãå¤åæ¨è«ã«ããè¿ä¼¼æ¨è«ã主æµã¨ãªã£ã¦ãã¾ãã3. ã§ã¯MCMCã«ããè§£æ³ã試ãã¦ã¿ããã¨æãã¾ãã
ãã¤ãºçæ©æ¢°å¦ç¿
ããã§ã¯ä¾ã¨ãã¦ãæå¸«ããå¦ç¿ããã¤ãºçã«ã©ã®ããã«è¡¨ç¾ã§ããããèãã¦ããã¾ãã
観測ããã¦ããå
¥å夿°ã®éåããã³åºå夿°ã®éå
ãä¸ãããã¦ãããã¢ãã«ãã©ã¡ã¼ã¿ã
ã¨ãã¾ããããã«ãæ°è¦å
¥å
ã«å¯¾ããäºæ¸¬åºåã
ã¨ãã¾ãããã®ã¨ãå確ç夿°éã®é¢ä¿ã¯ä»¥ä¸ã«ç¤ºãå³ã®éãã§ãããã®ããã«ç¢ºç夿°éã®ä¾åé¢ä¿ãã°ã©ãæ§é ã§è¡¨ç¾ãããã®ãã°ã©ãã£ã«ã«ã¢ãã«ã¨ããã¾ãã

æ±ãããäºæ¸¬åºåã®åå¸ã¯ã観測ãã¼ã¿
ããã³æ°è¦å
¥å
ãä¸ããããæã®æ¡ä»¶ä»ãåå¸
ã«ãã£ã¦å¾ããã¾ãããããã£ã¦ã
å¼ã«ã¤ãã¦
ãå¨è¾ºåãã
ãå¾ãããä¸ã§ã®æ¡ä»¶ä»ãåå¸ãæ±ããã¨ã
ããã§ããã©ã¡ã¼ã¿ã®äºå¾åå¸ã¯ãã¤ãºæ¨è«ã«ãã以ä¸ã®ããã«æ±ãããã¾ãã
以ä¸ããããã¤ãºæ¨è«ã«åºã¥ãæå¸«ããæ©æ¢°å¦ç¿ã¯ãå¼ã®è¦³æ¸¬ãã¼ã¿
ãç¨ãã¦ãã©ã¡ã¼ã¿
ã®äºå¾åå¸ãè¡ããå¦ç¿ãã¹ãããã
å¼ã®æ±ãããã©ã¡ã¼ã¿ã®äºå¾åå¸ããã¨ã«äºæ¸¬åå¸ãæ±ãããäºæ¸¬ãã¹ããããè¡ã£ã¦ããã¨è§£éã§ãã¾ãã
3. å®éã«åããã¦ã¿ã
æºå
ãã¤ãºçãªæ©æ¢°å¦ç¿ã®ä¾ãå®éã«åããã¦ã¿ã¾ããä»åã¯Kaggleã§å ¬éããã¦ããMedical Cost Personal Datasetsã使ç¨ãã¾ãããã®ãã¼ã¿ã»ããã¯ãã¢ã¡ãªã«ã®å»çä¿éºå¥ç´è ã«é¢ããåºæ¬çãªãã¼ã¿ã§ãããå¥ç´è ã®å¹´é½¢ã»æ§å¥ã»BMIã»åä¾ã®æ°ã»å«ç æç¡ã»å± ä½å°åã»å»çè²»ã®æ å ±ãå«ã¾ãã¦ãã¾ãããããå©ç¨ãã¦å¥ç´è ã®åºæ¬æ å ±ããå»çè²»ãäºæ¸¬ãããããªç°¡åãªåé¡è¨å®ãèãã¦ã¿ã¾ãã
ã¾ããçµ±è¨ã¢ãã«ã®å®è£ ã«ãããã確ççããã°ã©ãã³ã°ã¨ãããã©ãã¤ã ãå©ç¨ãã¾ããããã¯ç¢ºçåå¸ãæ±ãã¢ãã«ãããã°ã©ã ã®å½¢ã§è¨è¿°ãããã®ã¢ãã«ã«åºã¥ãã¦æ¨è«ããã¼ã¿ã®çæãè¡ããã®ã§ãã確ççããã°ã©ãã³ã°è¨èªã«ã¯ãStanãPyMCãEdwardãªã©æ§ã ãªãã®ãããã¾ãããä»åã¯æè»æ§ãé«ããæ±ããããStanã使ã£ã¦ããã¾ããStanã§è¨è¿°ããã¢ãã«ã¯RãPythonããå¼ã³åºãã¦å©ç¨ã§ãã¾ããä»åã®ä¾ã§ã¯Rã使ã£ã¦ãã¾ãã
ï¼åèï¼RStan Getting Started (Japanese))
æ¢ç´¢çãã¼ã¿åæ
ã¾ãã¯ãã®ãã¼ã¿ã»ãããRã§èªã¿åºãããã¼ã¿ã®ç¹å¾´ã確èªãã¦ããã¾ãã
# åºæ¬ã©ã¤ãã©ãªã®èªã¿è¾¼ã¿ library(rstan) library(ggplot2) library(dplyr) # ãã¼ã¿ã®èªã¿è¾¼ã¿ df <- read.csv(file='../data/insurance.csv') %>% as.data.frame() # éè¤ãã¦ããã¬ã³ã¼ããåé¤ df <- df[!duplicated(df),] # å é ã¬ã³ã¼ãã表示 head(df) # åºæ¬çµ±è¨éã®ç¢ºèª summary(df)


# Chargesï¼å»çè²»ï¼ã®åå¸ãç¢ºèª plot_charges <- ggplot(df, aes(x = charges)) + geom_histogram(bins = 40, aes(y = ..density..), fill = "#C59A5A", color = "black", alpha = 0.7) + geom_density(color = "blue", size = 0.7) + labs(x = "Charges", y = "Density", title = "Charges Distribution") plot_charges

library(gridExtra) # æ°å¤åã®ã«ã©ã ã«ã¤ãã¦ãã¹ãã°ã©ã ã§åå¸ãç¢ºèª # age plot_age <- ggplot(df, aes(x = age)) + geom_histogram(binwidth = 1, aes(y = ..density..), fill = "#C55A71", color = "black", alpha = 0.7) + labs(x = "Age", y = "Density", title = "Age Distribution") # bmi plot_bmi <- ggplot(df, aes(x = bmi)) + geom_histogram(bins = 30, aes(y = ..density..), fill = "#5A9EC5", color = "black", alpha = 0.7) + geom_density(color = "blue", size = 0.7) + labs(x = "BMI", y = "Density", title = "BMI Distribution") # children plot_children <- ggplot(df, aes(x = children)) + geom_histogram(binwidth = 1, aes(y = ..density..), fill = "#5AC573", color = "black", alpha = 0.7) + labs(x = "Children", y = "Density", title = "Children Distribution") # ageã¨chargesã®æ£å¸å³ plot_age_charges <- ggplot(df, aes(x = age, y = charges)) + geom_point(color = "#C55A71", alpha = 0.7) + labs(x = "Age", y = "Charges", title = "Age vs Charges") # bmiã¨chargesã®æ£å¸å³ plot_bmi_charges <- ggplot(df, aes(x = bmi, y = charges)) + geom_point(color = "#5A9EC5", alpha = 0.7) + labs(x = "BMI", y = "Charges", title = "BMI vs Charges") # childrenã¨chargesã®æ£å¸å³ plot_children_charges <- ggplot(df, aes(x = children, y = charges)) + geom_point(color = "#5AC573", alpha = 0.7) + labs(x = "Children", y = "Charges", title = "Children vs Charges") # ã°ã©ããã¾ã¨ãã¦è¡¨ç¤º grid.arrange(plot_age, plot_bmi, plot_children, plot_age_charges, plot_bmi_charges, plot_children_charges, ncol = 3)

- å¹´é½¢ã«ã¤ãã¦ã¯20æ³ä»¥ä¸ã®ãã¼ã¿ãå¤ããé«å¹´é½¢ã»ã©å»çè²»ãé«ããªãã3種é¡ã®å¾åã«åããã¦ããããã«è¦ãã
- BMIã«ã¤ãã¦ã¯30ä»è¿ãå¹³åã¨ãã¦æ£è¦åå¸ããBMIãé«ãã»ã©å»çè²»ãé«ããªãå¾åã¨BMIã«å¯¾ãå»çè²»ãæ¨ªã°ãã®2種é¡ã®å¾åãã¿ãã
- åä¾ã®æ°ãå¢ããã»ã©å»çè²»ãä¸ããããåä¾ã4人以ä¸ã®ãã¼ã¿ã¯å ¨ä½ã®5%ç¨åº¦ã§ããä¿¡é ¼æ§ã«ã¯æ¬ ãã
次ã«ã«ãã´ãªåã®å¤æ°ï¼æ§å¥ã»å«ç æç¡ã»å± ä½å°åï¼ã«ã¤ãã¦åå¸ã¨ç®ç夿°ã¨ã®é¢ä¿ã確èªãã¦ããã¾ãã
# ã«ãã´ãªåã®å¤æ°ã«ã¤ãã¦åã°ã©ãã§åå¸ã確èª
create_pie_chart <- function(data, column, title) {
data %>%
count(!!sym(column)) %>%
mutate(percentage = n / sum(n) * 100) %>%
ggplot(aes(x = "", y = n, fill = !!sym(column))) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar(theta = "y") +
labs(fill = column, title = title, y = "", x = "") +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_stack(vjust = 0.5), size = 4) +
theme_minimal() +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank())
}
# sex
plot_sex <- create_pie_chart(df, "sex", "Sex Distribution")
# smoker
plot_smoker <- create_pie_chart(df, "smoker", "Smoker Distribution")
# region
plot_region <- create_pie_chart(df, "region", "Region Distribution")
# sexã¨chargesã®ããã¯ã¹ãããã
plot_sex_charges <- ggplot(df, aes(x = sex, y = charges, fill = sex)) +
geom_boxplot(alpha = 0.7, outlier.color = "red", outlier.shape = 16) +
labs(x = "Sex", y = "Charges", title = "Sex vs Charges") +
theme_minimal() +
theme(legend.position = "none")
# smokerã¨chargesã®ããã¯ã¹ãããã
plot_smoker_charges <- ggplot(df, aes(x = smoker, y = charges, fill = smoker)) +
geom_boxplot(alpha = 0.7, outlier.color = "red", outlier.shape = 16) +
labs(x = "Smoker", y = "Charges", title = "Smoker vs Charges") +
theme_minimal() +
theme(legend.position = "none")
# regionã¨chargesã®ããã¯ã¹ãããã
plot_region_charges <- ggplot(df, aes(x = region, y = charges, fill = region)) +
geom_boxplot(alpha = 0.7, outlier.color = "red", outlier.shape = 16) +
labs(x = "Region", y = "Charges", title = "Region vs Charges") +
theme_minimal() +
theme(legend.position = "none")
# ã°ã©ããã¾ã¨ãã¦è¡¨ç¤º
grid.arrange(plot_sex, plot_smoker, plot_region, plot_sex_charges, plot_smoker_charges, plot_region_charges, ncol = 3)

- ç·å¥³ã®å²åã¯åã ã§ãå»çè²»ã«å¤§ããªéãã¯ãªã
- å«ç è ã®å²åã¯å°ãªããéå«ç è ã¨æ¯è¼ãã¦å»çè²»ãé常ã«å¤§ãã
- å± ä½å°åã®å²åã¯ããããåç¨åº¦ã§å»çè²»ã«å¤§ããªéãã¯ãªã
ãã¤ãºç·å½¢å帰
å夿°ã¨å»çè²»ã®åºæ¬çãªé¢ä¿ã確èªã§ãã¾ãããä»åã¯ã·ã³ãã«ãªç·å½¢å帰ã¢ãã«ã«å½ã¦ã¯ãã¦èãã¦ã¿ã¾ããç·å½¢å帰ã¢ãã«ã§ã¯ã次å
ã®å
¥åãã¯ãã«
ãä»»æã®åºåºé¢æ°
ã«ãã
次å
ã®ç¹å¾´ç©ºéã«ååãã
ã¨éã¿ãã¯ãã«
ã®ç·å½¢çµåãããã³å¹³å
ã忣
ã®ã¬ã¦ã¹åå¸ã«å¾ããã¤ãºé
ã«ããåºå
ã表ç¾ããã¾ãã
å¼ããã
ã¯å¹³å
ã忣
ã®ã¬ã¦ã¹åå¸ã«å¾ã確ç夿°ã¨ãªãã¾ãããããã£ã¦ã
ã¨
ãå¾ãããä¸ã§ã®
ã®æ¡ä»¶ä»ãåå¸ã¯ã以ä¸ã®ããã«æ¸ãã¾ãã
æ¨è«ï¼å¦ç¿ï¼ããããã©ã¡ã¼ã¿ã¯ã§ãããããäºååå¸
,
ãè¨å®ããå¦ç¿ãã¼ã¿
ã観測ããä¸ã§ã®äºå¾åå¸
ããã¤ãºæ¨è«ã«ããæ±ãã¾ãã
å¼ãå©ç¨ãã¦ãäºå¾åå¸ã¯ä»¥ä¸ã§æ±ãããã¾ãã
åºåºé¢æ°ã«ã¤ãã¦ãã¨ããå ´åãä¸è¬çã«éå帰ã¨å¼ã°ããã¢ãã«ã«ãªãã¾ããä»åã®åé¡ã§èª¬æå¤æ°ã¯
ãéã¿ãã¯ãã«ã¯
ã§ãã
ã¾ããä»åã¯ã¨
ã®äºååå¸ãä»®å®ããã«ãããäºåã«æã¡åããã¦ããæ ¹æ ãæ
å ±ã¯ç¹ã«ããã¾ããããã®ãããªå ´åãååã«åºãå¹
ãæã¤ä¸æ§åå¸ãäºååå¸ã¨ãã¦ããç¨ãããã¾ãããã®ãããªåå¸ãç¡æ
å ±äºååå¸*1ã¨ããã¾ãã
以ä¸ã®ã¢ãã«ãStanã§è¨è¿°ããã¨ä»¥ä¸ã®ããã«ãªãã¾ãã
// dataãããã¯ã§ã¯ã¢ãã«ã«ä¸ããæ¢ç¥ã®è¦³æ¸¬ãã¼ã¿ãåºå®å¤ãå®ç¾©ãã¾ãã
// ä»åã¯ãã¹ããã¼ã¿ã«å¯¾ããäºæ¸¬åå¸ã¾ã§æ±ãããããã¹ããã¼ã¿ç¨ã®å¤æ°ãä½µãã¦å®ç¾©ãã¦ãã¾ãã
data {
int<lower=0> N; // è¨ç·´ãã¼ã¿ã®ãµã³ãã«æ°
vector[N] age; // ageå
vector[N] sex; // sexå
vector[N] bmi; // bmiå
vector[N] children; // childrenå
vector[N] smoker; // smokerå
vector[N] region; // regionå
vector[N] y; // ç®ç夿° (charges)
int<lower=0> N_test; // ãã¹ããã¼ã¿ã®ãµã³ãã«æ°
vector[N_test] age_test; // ãã¹ããã¼ã¿ã®ageå
vector[N_test] sex_test; // ãã¹ããã¼ã¿ã®sexå
vector[N_test] bmi_test; // ãã¹ããã¼ã¿ã®bmiå
vector[N_test] children_test; // ãã¹ããã¼ã¿ã®childrenå
vector[N_test] smoker_test; // ãã¹ããã¼ã¿ã®smokerå
vector[N_test] region_test; // ãã¹ããã¼ã¿ã®regionå
}
// parametersãããã¯ã§ã¯æ¨å®ãã¹ãæªç¥ã®ãã©ã¡ã¼ã¿ãå®ç¾©ãã¾ãã
parameters {
real alpha; // åç
real beta_age; // ageã®ä¿æ°
real beta_sex; // sexã®ä¿æ°
real beta_bmi; // bmiã®ä¿æ°
real beta_children; // childrenã®ä¿æ°
real beta_smoker; // smokerã®ä¿æ°
real beta_region; // regionã®ä¿æ°
real<lower=0> sigma; // æ®å·®ã®æ¨æºåå·®
}
// modelãããã¯ã§ã¯ãã©ã¡ã¼ã¿ã®äºååå¸ãã¢ãã«æ§é ãå®ç¾©ãã¾ãã
model {
// äºååå¸ãè¨å®ããå ´åã¯ããã«è¨è¼ãã¾ãã
// e.g. alpha ~ normal(0, 100);
// çç¥ããå ´åã¯ç¡æ
å ±äºååå¸ã¨ãã¦ãååã«å¹
ã®åºã䏿§åå¸ãè¨å®ããã¾ãã
// 尤度
y ~ normal(
alpha +
beta_age * age +
beta_sex * sex +
beta_bmi * bmi +
beta_children * children +
beta_smoker * smoker +
beta_region * region,
sigma
);
}
// generated quantitiesãããã¯ã¯æ¨å®çµæããæ´¾çããå¤ãäºæ¸¬å¤ãçæãã¾ãã
// ä»åã¯ãã¹ããã¼ã¿ã«å¯¾ããäºæ¸¬åå¸ã¾ã§æ±ãããã以ä¸ã§å®ç¾©ãã¦ãã¾ãã
generated quantities {
vector[N_test] y_test_pred;
for (i in 1:N_test) {
y_test_pred[i] = normal_rng(
alpha +
beta_age * age_test[i] +
beta_sex * sex_test[i] +
beta_bmi * bmi_test[i] +
beta_children * children_test[i] +
beta_smoker * smoker_test[i] +
beta_region * region_test[i],
sigma
);
}
}
ä¸è¨ãmodel1.stanã¨ãããã¡ã¤ã«åã§ä¿åãã¦ããããã®å¾Rããèªã¿è¾¼ãã§å©ç¨ãã¾ãã
# ã©ãã«ã¨ã³ã³ã¼ãã£ã³ã°
df <- df %>%
mutate(
sex = as.numeric(factor(sex, levels = unique(sex))) - 1,
smoker = 1 - (as.numeric(factor(smoker, levels = unique(smoker))) - 1),
region = as.numeric(factor(region, levels = unique(region)))
)
# ãã¼ã¿ãè¨ç·´ãã¼ã¿ã¨ãã¹ããã¼ã¿ã«åå²
trainIndex <- createDataPartition(df$charges, p = 0.8, list = FALSE)
train_data <- df[trainIndex, ]
test_data <- df[-trainIndex, ]
# Stanã«æ¸¡ããã¼ã¿ãªã¹ãã®ä½æ
stan_data <- list(
N = nrow(train_data),
age = train_data$age,
sex = train_data$sex,
bmi = train_data$bmi,
children = train_data$children,
smoker = train_data$smoker,
region = train_data$region,
y = train_data$charges,
N_test = nrow(test_data),
age_test = test_data$age,
sex_test = test_data$sex,
bmi_test = test_data$bmi,
children_test = test_data$children,
smoker_test = test_data$smoker,
region_test = test_data$region
)
# Stanã¢ãã«ã®å®è¡
# Stanã§ã¯"No-U-turn sampler (NUTS)"ã¨ããMCMCææ³ãããã©ã«ãã§å©ç¨ããã¾ãã
fit <- stan(
file = "model1.stan",
data = stan_data,
iter = 4000,
chains = 4
)
fitã«ã¯MCMCã«ãã£ã¦å¾ããããã¹ã¦ã®ãã©ã¡ã¼ã¿ã®ãµã³ããªã³ã°çµæãæ ¼ç´ããã¦ãã¾ãã以ä¸ã®ããã«çµæã確èªã§ãã¾ãã
# ãµã³ããªã³ã°çµæã®ç¢ºèªï¼ä¸é¨ã®ã¿æç²ï¼
print(fit, pars = c("alpha", "beta_age", "beta_sex", "sigma", "y_test_pred[1]", "y_test_pred[2]"))

次ã«ãã¹ããã¼ã¿ã«å¯¾ããäºæ¸¬å¤ã確èªãã¦ã¿ã¾ããy_test_pred[i]ã«äºæ¸¬åå¸ãæ ¼ç´ããã¦ãã¾ãããããã¯1ç¹ã«å®ã¾ãå¤ã§ã¯ãªãã®ã§ãä»åã¯ãµã³ãã«å¹³åã代表å¤ã¨ãã¦ç¢ºèªãã¦ã¿ã¾ãã
# ãµã³ãã«ã®æ½åº
y_test_pred_samples <- rstan::extract(fit, pars = "y_test_pred")$y_test_pred
# ãµã³ãã«å¹³åå¤ãäºæ¸¬å¤ã¨ãã
y_test_pred_mean <- colMeans(y_test_pred_samples)
comparison <- data.frame(
Actual = test_data$charges,
Predicted = y_test_pred_mean,
Smoker = factor(test_data$smoker)
)
axis_limit <- range(
c(comparison$Actual, comparison$Predicted),
na.rm = TRUE,
finite = TRUE
)
# çå¤ã¨æ¨å®å¤ã®æ£å¸å³ããããã
ggplot(comparison, aes(x = Actual, y = Predicted, color = Smoker)) +
geom_point(alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(
title = "Actual vs Predicted Charges by Smoker Status",
x = "Actual Charges",
y = "Predicted Charges",
color = "Smoker"
) +
theme_minimal() +
coord_fixed(ratio = 1) +
scale_x_continuous(limits = axis_limit) +
scale_y_continuous(limits = axis_limit)

é層ã¢ãã«
次ã«éå«ç è ã¨å«ç è ã§åãã¦å¾åãè¦ã¦ã¿ã¾ãã
# bmiã¨chargesã®æ£å¸å³ï¼smokerå¥ï¼
plot_bmi_smoker <- ggplot(df, aes(x = bmi, y = charges, color = smoker)) +
geom_point(alpha = 0.7) +
labs(title = "BMI vs Charges",
x = "BMI",
y = "Charges") +
guides(colour=FALSE) +
theme_minimal()
# ageã¨chargesã®æ£å¸å³ï¼smokerå¥ï¼
plot_age_smoker <- ggplot(df, aes(x = age, y = charges, color = smoker)) +
geom_point(alpha = 0.7) +
labs(title = "Age vs Charges",
x = "Age",
y = "Charges") +
guides(colour=FALSE) +
theme_minimal()
# ã°ã©ããã¾ã¨ãã¦è¡¨ç¤º
grid.arrange(plot_bmi_smoker, plot_age_smoker, ncol = 2)

- å«ç è (smoker)ã®ã¢ãã«
- éå«ç è (non-smoker)ã®ã¢ãã«
ãã ãã
ãã®ã¢ãã«ã§ã¯ããã©ã¡ã¼ã¿ãå«ç ã®æç¡ã«ãã£ã¦ç°ãªããããããã¯ããä¸ä½ã®åãåå¸ããçæãããï¼ä¼¼ããããªå¾åã¨ãªãï¼ã¨ããå¶ç´ãæã£ã¦ãã¾ãããã®ãããªã¢ãã«ã¯é層ã¢ãã«ã¨å¼ã°ãã¾ãããã®ã¢ãã«ãStanã§è¨è¿°ããã¨ã
data {
int<lower=0> N; // è¨ç·´ãã¼ã¿ã®ãµã³ãã«æ°
vector[N] age; // å¹´é½¢
vector[N] bmi; // BMI
int<lower=0,upper=1> smoker[N]; // å«ç
ã¹ãã¼ã¿ã¹ï¼0: éå«ç
è
ã1: å«ç
è
ï¼
vector[N] y; // å»çè²»ï¼å¯¾æ°å¤æãããã®ï¼
int<lower=0> N_test; // ãã¹ããã¼ã¿ã®ãµã³ãã«æ°
vector[N_test] age_test; // ãã¹ããã¼ã¿ã®å¹´é½¢
vector[N_test] bmi_test; // ãã¹ããã¼ã¿ã®BMI
int<lower=0,upper=1> smoker_test[N_test]; // ãã¹ããã¼ã¿ã®å«ç
ã¹ãã¼ã¿ã¹
}
parameters {
// ä¸ä½åå¸ã®ãã©ã¡ã¼ã¿
real mu_alpha; // åçã®å¹³å
real<lower=0> tau_alpha; // åçã®æ¨æºåå·®
real mu_beta_age; // å¹´é½¢ä¿æ°ã®å¹³å
real<lower=0> tau_beta_age; // å¹´é½¢ä¿æ°ã®æ¨æºåå·®
real mu_beta_bmi; // BMIä¿æ°ã®å¹³å
real<lower=0> tau_beta_bmi; // BMIä¿æ°ã®æ¨æºåå·®
real mu_beta_smoker; // å«ç
广ã®å¹³å
real<lower=0> tau_beta_smoker; // å«ç
å¹æã®æ¨æºåå·®
// å«ç
ã°ã«ã¼ããã¨ã®ãã©ã¡ã¼ã¿
real alpha_smoker; // å«ç
è
ã®åç
real alpha_non_smoker; // éå«ç
è
ã®åç
real beta_age_smoker; // å«ç
è
ã®å¹´é½¢ä¿æ°
real beta_age_non_smoker; // éå«ç
è
ã®å¹´é½¢ä¿æ°
real beta_bmi_smoker; // å«ç
è
ã®BMIä¿æ°
real beta_bmi_non_smoker; // éå«ç
è
ã®BMIä¿æ°
real beta_smoker; // å«ç
广ã®åå¸°ä¿æ°
// æ¨æºåå·®
real<lower=0> sigma_smoker;
real<lower=0> sigma_non_smoker;
}
model {
// ä¸ä½åå¸ã®äºååå¸
mu_alpha ~ normal(0, 10);
tau_alpha ~ cauchy(0, 2);
mu_beta_age ~ normal(0, 1);
tau_beta_age ~ cauchy(0, 2);
mu_beta_bmi ~ normal(0, 1);
tau_beta_bmi ~ cauchy(0, 2);
mu_beta_smoker ~ normal(0, 1);
tau_beta_smoker ~ cauchy(0, 2);
// å«ç
ã°ã«ã¼ããã¨ã®ãã©ã¡ã¼ã¿ã®äºååå¸ï¼é層æ§é ï¼
alpha_smoker ~ normal(mu_alpha, tau_alpha);
alpha_non_smoker ~ normal(mu_alpha, tau_alpha);
beta_age_smoker ~ normal(mu_beta_age, tau_beta_age);
beta_age_non_smoker ~ normal(mu_beta_age, tau_beta_age);
beta_bmi_smoker ~ normal(mu_beta_bmi, tau_beta_bmi);
beta_bmi_non_smoker ~ normal(mu_beta_bmi, tau_beta_bmi);
beta_smoker ~ normal(mu_beta_smoker, tau_beta_smoker);
// æ¨æºåå·®ã®äºååå¸
sigma_smoker ~ normal(0, 1);
sigma_non_smoker ~ normal(0, 1);
// 尤度
for (i in 1:N) {
if (smoker[i] == 1) {
y[i] ~ normal(
alpha_smoker + beta_age_smoker * age[i] + beta_bmi_smoker * bmi[i] + beta_smoker * smoker[i],
sigma_smoker
);
} else {
y[i] ~ normal(
alpha_non_smoker + beta_age_non_smoker * age[i] + beta_bmi_non_smoker * bmi[i] + beta_smoker * smoker[i],
sigma_non_smoker
);
}
}
}
generated quantities {
vector[N_test] y_test_pred;
for (i in 1:N_test) {
if (smoker_test[i] == 1) {
y_test_pred[i] = exp(normal_rng(
alpha_smoker + beta_age_smoker * age_test[i] + beta_bmi_smoker * bmi_test[i] + beta_smoker * smoker_test[i],
sigma_smoker
));
} else {
y_test_pred[i] = exp(normal_rng(
alpha_non_smoker + beta_age_non_smoker * age_test[i] + beta_bmi_non_smoker * bmi_test[i] + beta_smoker * smoker_test[i],
sigma_non_smoker
));
}
}
}
ä¸è¨ãmodel2.stanã¨ãã¦ä¿åããRããå¼ã³åºãã¦æ¨è«ãå®è¡ãã¾ãã
# chargesã®å¯¾æ°å¤æ
df <- df %>%
mutate(
log_charges = log(charges)
)
# ãã¼ã¿ãè¨ç·´ãã¼ã¿ã¨ãã¹ããã¼ã¿ã«åå²
trainIndex <- createDataPartition(df$log_charges, p = 0.8, list = FALSE)
train_data <- df[trainIndex, ]
test_data <- df[-trainIndex, ]
# è¨ç·´ãã¼ã¿ã®å¹³åã¨æ¨æºåå·®ãè¨ç®
age_mean <- mean(train_data$age)
age_sd <- sd(train_data$age)
bmi_mean <- mean(train_data$bmi)
bmi_sd <- sd(train_data$bmi)
# è¨ç·´ãã¼ã¿ã®æ£è¦å
train_data <- train_data %>%
mutate(
age = (age - age_mean) / age_sd,
bmi = (bmi - bmi_mean) / bmi_sd
)
# ãã¹ããã¼ã¿ã®æ£è¦å
test_data <- test_data %>%
mutate(
age = (age - age_mean) / age_sd,
bmi = (bmi - bmi_mean) / bmi_sd
)
# Stanç¨ãã¼ã¿ãæºå (åå¥ã®ãã¯ãã«ã¨ãã¦æ¸¡ã)
stan_data <- list(
N = nrow(train_data),
age = train_data$age,
bmi = train_data$bmi,
smoker = train_data$smoker,
y = train_data$log_charges,
N_test = nrow(test_data),
age_test = test_data$age,
bmi_test = test_data$bmi,
smoker_test = test_data$smoker
)
# ã¢ãã«ãå®è¡
fit <- stan(
file = "model2.stan",
data = stan_data,
iter = 4000,
chains = 4
)
# ãµã³ããªã³ã°çµæã®ç¢ºèªï¼ä¸é¨ã®ã¿æç²ï¼
print(fit, pars = c("alpha_smoker", "alpha_non_smoker", "beta_bmi_smoker", "beta_bmi_non_smoker"))

# ãµã³ãã«ã®æ½åº
y_test_pred_samples <- rstan::extract(fit, pars = "y_test_pred")$y_test_pred
y_test_pred_mean <- colMeans(y_test_pred_samples)
comparison <- data.frame(
Actual = test_data$charges,
Predicted = y_test_pred_mean,
Smoker = factor(test_data$smoker)
)
axis_limit <- range(
c(comparison$Actual, comparison$Predicted),
na.rm = TRUE,
finite = TRUE
)
ggplot(comparison, aes(x = Actual, y = Predicted, color = Smoker)) +
geom_point(alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(
title = "Actual vs Predicted Charges",
x = "Actual Charges",
y = "Predicted Charges",
color = "Smoker"
) +
theme_minimal() +
coord_fixed(ratio = 1) +
scale_x_continuous(limits = axis_limit) +
scale_y_continuous(limits = axis_limit)

4. ãããã«
ãã®è¨äºã§ã¯Stanãå©ç¨ãããã¤ãºçãªæ©æ¢°å¦ç¿ã«ã¤ãã¦ãç´¹ä»ãã¾ãããä»åã®åé¡è¨å®ã¯ç°¡åãªãã®ã§ãããä¸è¬çã«å©ç¨ããç·å½¢å帰ã¢ãã«ã§æ±ããçµæã¨å¤§ããã¯å¤ãããªãã¨æãã¾ãããããããã¤ãºçè«ã§ã¯äºåç¥èãä¸ç¢ºå®æ§ãã¢ãã«ã«çµã¿è¾¼ããã¨ãã§ããè§£éæ§ã®é«ããå°è¦æ¨¡ãã¼ã¿ã«å¯¾ããé å¥ãã鿬¡å¦ç¿ãªã©æ§ã ãªé åãããã¾ããæ®æ®µã¯Boostingã¢ãã«ã«çªã£è¾¼ãã§çµããï¼ã¨ããæ¹ãå®éã«ãã¼ã¿ãè¦ã¦ãStanãªã©ã§ã¢ãã«ãä½ã£ã¦åããã¦ã¿ãã¨ç¢ºççãªã¢ããªã³ã°ã®é¢ç½ããæ´ãã¦ãããã¨æãã¾ãã®ã§ãèå³ãããã°è§¦ã£ã¦ã¿ã¦ãã ããï¼
ãããã¨ããããã¾ããã
åèæ¸ç±
- é å±±æ¦å¿. ãã¤ãºæ¨è«ã«ããæ©æ¢°å¦ç¿å
¥é. è¬è«ç¤¾, 2017.

- é å±±æ¦å¿. ãã¤ãºæ·±å±¤å¦ç¿. è¬è«ç¤¾, 2019.

- æ¾æµ¦å¥å¤ªé. Stanã¨Rã§ãã¤ãºçµ±è¨ã¢ããªã³ã°. å
±ç«åºç, 2017.

*1:è£è¶³ã¨ãã¦ãè¿å¹´ã¯äºååå¸ã«ããã¦ç¡æ å ±äºååå¸ã§ã¯ãªããæä½éã®æ å ±ãä¸ããå¼±æ å ±äºååå¸ãç¨ããæ¹ãè¯ãã¨ããã¦ããã¾ããä¾ãã°ãæ¨æºåååå¸°ä¿æ°ã§ããã°ãã®å¤ã¯é«ã -1~1ã«ããã¾ããããå¹³åã¯0ãscaleã¯1ã2ã»ã©ãèªç±åº¦3ã7ã»ã©ã®tåå¸ã«ããå¼±æ å ±äºååå¸ãæ¨å¥¨ããã¦ããã¾ããèªç±åº¦3ã7ã®tåå¸ã®çç±ã¯ããã¡ãããã¼ã«ã§è£¾ãããããåããããã¹ãæ§ãæ ä¿ã§ããããã§ããï¼èªç±åº¦ã®è£è¶³ã¨ãã¦ãèªç±åº¦1ã®tåå¸ãã¤ã¾ãcauchyåå¸ã ã¨ã裾ãåããã¦ããã¹ãã ãç¡æ å ±äºååå¸ã«è¿ãããæ¨å¥¨ããã¾ãããã¾ããèªç±åº¦ã8以ä¸ã®tåå¸ã ã¨ãæ£è¦åå¸ã«è¿ã¥ãã·ã§ã¼ããã¼ã«ãªåå¸ã¨ãªããããã¹ãæ§ãæ ä¿ã§ããªããªãããæ¨å¥¨ããã¦ããã¾ããããã ããäºååå¸ãæ£è¦åå¸ã¨ããã¨ãL1ãã«ã ï¼Ridgeï¼ã¨åãåããããããã使ãåããããã¨ã大äºã§ããï¼