formulaとは？（1） - はやしのブログ Rev.3

某所でRのformulaオブジェクトの扱い方について話題になっていて，自分自身あまり詳しくはなかったので，まとめておく。

formulaはRに特有のクラスで，使いこなせば分析を非常に効率化できるが，つっこんだ情報はあまり見かけない。formulaは基本的には回帰系の関数の引数に用いられる，モデルを表現するオブジェクトである。

fm1 <- y ~ x1 + x2 + x3
class(fm1)
#=> [1] "formula"
update(fm1, . ~ . + x4)  # updateで書き換えられる
#=> y ~ x1 + x2 + x3 + x4
update(fm1, . ~ . -x1)
#=> y ~ x2 + x3 + x4
update(fm1, y2 ~ (.)^2)
#=> y2 ~ x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3 # x1:x2等は交互作用項

fm1は下の回帰式と対応する。

$\hat{y} = b_0 + b_1x_1 + b_2x_2 + b_3x_3$

重要なのは，fm1を作成する際に，環境中にy，x1，x2，x3という変数が存在する必要はない，ということだ。formulaはモデルの構造を表現しているにすぎず，実在の変数とのつながりのない抽象的なオブジェクトである。formulaを実在の変数と結びつけるには，model.frame()を使う。

y <- 1:3; x1 <- 4:6; x2 <- 7:9; x3 <- 10:12
(d1 <- model.frame(fm1))  # デフォルトでは大局的環境から
#=>   y x1 x2 x3          # 変数を取得し，データフレームを返す
#=> 1 1  4  7 10
#=> 2 2  5  8 11
#=> 3 3  6  9 12
d2 <- d1 * 2
model.frame(fm2, data=d2)  # data=...で変数を取得する
#=>   y x1 x2 x3           # データフレームを指定
#=> 1 2  8 14 20
#=> 2 4 10 16 22
#=> 3 6 12 18 24

formulaからモデルの構造の情報を取り出すには，terms()を使う。

(tm1 <- terms(fm1))      # termsオブジェクトを返す
#=> y ~ x1 + x2 + x3     # termsオブジェクトの詳細はhelp(terms.object)
#=> attr(,"variables")
#=> list(y, x1, x2, x3)
#=> attr(,"factors")
#=>    x1 x2 x3
#=> y   0  0  0
#=> x1  1  0  0
#=> x2  0  1  0
#=> x3  0  0  1
#=> attr(,"term.labels")
#=> [1] "x1" "x2" "x3"
#=> attr(,"order")
#=> [1] 1 1 1
#=> attr(,"intercept")
#=> [1] 1
#=> attr(,"response")
#=> [1] 1
#=> attr(,".Environment")

ちなみに，model.frame()で取り出したデータフレームには,属性としてtermsが格納されている。

str(d1)
#=> 'data.frame':   3 obs. of  4 variables:
#=>  $ y : int  1 2 3
#=>  $ x1: int  4 5 6
#=>  $ x2: int  7 8 9
#=>  $ x3: int  10 11 12
#=>  - attr(*, "terms")=Classes 'terms', 'formula' length 3 y ~ x1 + x2 + x3
#=>   .. ..- attr(*, "variables")= language list(y, x1, x2, x3)
#=>   .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
#=>   .. .. ..- attr(*, "dimnames")=List of 2
#=>   .. .. .. ..$ : chr [1:4] "y" "x1" "x2" "x3"
#=>   .. .. .. ..$ : chr [1:3] "x1" "x2" "x3"
#=>   .. ..- attr(*, "term.labels")= chr [1:3] "x1" "x2" "x3"
#=>   .. ..- attr(*, "order")= int [1:3] 1 1 1
#=>   .. ..- attr(*, "intercept")= int 1
#=>   .. ..- attr(*, "response")= int 1
#=>   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#=>   .. ..- attr(*, "predvars")= language list(y, x1, x2, x3)
#=>   .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
#=>   .. .. ..- attr(*, "names")= chr [1:4] "y" "x1" "x2" "x3"
class(attr(d1,"terms"))
#=> [1] "terms"   "formula"

attr()でtermsオブジェクトを取り出し，さらにtermsオブジェクトからattr()で情報を取り出すこともできるが，model.frame用の便利な関数が用意されている。

model.extract(d1,"response")   # responseは，formulaでチルダ（~）の左にある変数
#=> 1 2 3      # 行の名前
#=> 1 2 3      # 内容
model.response(d1)  # ほぼ同じ。詳細はhelp(model.exract)
#=> 1 2 3 
#=> 1 2 3