質的変数と回帰分析

質的変数と回帰分析#

in English or the language of your choice.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import wooldridge

# 警告メッセージを非表示
import warnings
warnings.filterwarnings("ignore")

実証分析で使う変数を次の２種類に区別することができる。

量的変数（体重，賃金など）
質的変数（性別，人種，地域など）

今までは量的変数を考えたが，ここでは質的変数について議論する。まず男女のような２つの属性に分けるダミー変数を考える。その後に，より一般的な3つ以上の属性をに特徴付けられるカテゴリー変数を扱う場合を説明する。

ダミー変数#

ケース１：定数項だけの回帰分析#

ダミー変数なし#

直感的な説明にするために説明変数が定数項だけの回帰分析から始める。具体的には次の回帰式を考える。

\(y=\beta_0+u\)

実は，この場合のOLS推定量\(\hat{\beta}_0\)は被説明変数\(y\)の平均と等しいことになる。この結果を確認するために以下ではwooldridgeパッケージのwage1のデータを使う。

wage1 = wooldridge.data('wage1')
wooldridge.data('wage1',description=True)

name of dataset: wage1
no of variables: 24
no of observations: 526

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| wage     | average hourly earnings         |
| educ     | years of education              |
| exper    | years potential experience      |
| tenure   | years with current employer     |
| nonwhite | =1 if nonwhite                  |
| female   | =1 if female                    |
| married  | =1 if married                   |
| numdep   | number of dependents            |
| smsa     | =1 if live in SMSA              |
| northcen | =1 if live in north central U.S |
| south    | =1 if live in southern region   |
| west     | =1 if live in western region    |
| construc | =1 if work in construc. indus.  |
| ndurman  | =1 if in nondur. manuf. indus.  |
| trcommpu | =1 if in trans, commun, pub ut  |
| trade    | =1 if in wholesale or retail    |
| services | =1 if in services indus.        |
| profserv | =1 if in prof. serv. indus.     |
| profocc  | =1 if in profess. occupation    |
| clerocc  | =1 if in clerical occupation    |
| servocc  | =1 if in service occupation     |
| lwage    | log(wage)                       |
| expersq  | exper^2                         |
| tenursq  | tenure^2                        |
+----------+---------------------------------+

These are data from the 1976 Current Population Survey, collected by
Henry Farber when he and I were colleagues at MIT in 1988.

時間平均賃金wageを被説明変数に設定する。statsmodelsでは，定数項だけの回帰式を考える場合，1をつか加える必要がある。

form_const = 'wage ~ 1'  # 定数項だけの場合は１が必要

res_const = smf.ols(form_const, data=wage1).fit()

res_const.params

Intercept    5.896103
dtype: float64

この推定値が賃金の平均と等しいことを確認するために，直接平均を計算する。

wage1.loc[:,'wage'].mean()

5.896102674787035

この結果はGM仮定4の\(\text{E}(u|X)=0\)から簡単に導出できる。この仮定を標本回帰式で考えると次式となる。

\[\frac{1}{N}\sum_{i=1}^Nu_i=0\]

この式の左辺に回帰式を代入すると

\[\frac{1}{N}\sum_{i=1}^N\left( y_i-\beta_0\right)=\bar{y}-\beta_0\]

この式が0となる\(\hat{\beta_0}\)がOLS推定量なので

\[\bar{y}=\hat{\beta_0}\]

となる。

ダミー変数あり：２つのケース#

同じデータを使って\(\{0,1\}\)の値を取るダミー変数を考える。データセットwage1の中のfemaleという変数があり，以下の値を取る。

女性の場合：female = 1
男性の場合：female = 0

値が０のカテゴリーを基本カテゴリーという。

\(D\)をfemaleのダミー変数とすると回帰式は以下のようになる。

\[ y=\beta_0+\beta_1D \]

さらに，この式は\(D=\{0,1\}\)の値によって以下のように表すことができる。

男性：\(D=0\quad\Rightarrow\quad y=\beta_0+u\)

女性：\(D=1\quad\Rightarrow\quad y=\beta_0+\beta_1+u\)

即ち，OLS推定量は以下を表すことになる。

\(\hat{\beta}_0\)：男性の平均賃金

\(\hat{\beta}_0+\hat{\beta}_1\)：女性の平均賃金

この回帰式を使い，時間賃金の男女間の差について考察する。

form_const_2 = 'wage ~ female'

res_const_2 = smf.ols(form_const_2, data=wage1).fit()

res_const_2.params

Intercept    7.099489
female      -2.511830
dtype: float64

female=0の場合は男性なので，定数項の値（約7.10）が男性の時間賃金の平均である。
female=1の場合は女性なので，女性の時間賃金の平均は

\[7.10-2.51\approx 4.59\]

即ち，男性（基本カテゴリー）と比べて女性の平均賃金は2.51ドル低い。

データを使い直接男女の平均を計算して確認する。

＜女性の場合＞

# 女性だけを抽出するTrue/False条件の作成
cond_female = (wage1['female']==1)

wage1.loc[cond_female,'wage'].mean()

4.587658740225292

＜男性の場合＞

# 男性だけを抽出するTrue/False条件の作成
cond_male = (wage1['female']==0)

wage1.loc[cond_male,'wage'].mean()

7.099489067157689

（解釈）

女性の時間賃金は約2.51ドル低い
しかし比較する場合，同じ条件で比べるべきではないだろうか。例えば，未婚・既婚，教育年数や就労期間が賃金に影響する場合，この差を考慮すべきである。しかし，この回帰式にはそのような変数は含まれていないため，含まれていない変数の違いが賃金の違いに反映されている可能性が高い。

ダミー変数あり：４つのケース#

データセットwage1にはmarriedという変数が含まれており，以下の値をとる。

既婚者の場合：married = 1
未婚者の場合：married = 0

femaleと組み合わせることにより，次の４つのケースを分けることができる。

未婚男性：female=0, married=0
未婚女性：female=1, married=0
既婚女性：female=1, married=1
既婚男性：female=0, married=1

この４つのケースを捉えるために、femaleとmarriedの値によって0もしくは1の値になるダミー変数を作成するが，２つのステップに分けて説明する。

＜ステップ１＞

新たなダミー変数の作成ルールを定義する関数を作成する。この関数ではDataFrameの行を引数とし，ステップ２ではこの関数を各行に適用しダミー変数を作成することになる。

# 以下では row をDataFrameの行と考える。

# 未婚男性の関数
def singmale(row):
    if row['female'] == 0 and row['married'] == 0:
        return 1
    else:
        return 0

# 既婚男性の関数
def marmale(row):
    if row['female'] == 0 and row['married'] == 1:
        return 1
    else:
        return 0

# 未婚女性の関数
def singfem(row):
    if row['female'] == 1 and row['married'] == 0:
        return 1
    else:
        return 0

# 既婚女性の関数
def marfem(row):
    if row['female'] == 1 and row['married'] == 1:
        return 1
    else:
        return 0

singmaleを考えてみる。引数のrowは行を表すので、例としてwage1の0番目の行を考えてみよう。その行にはfemaleとmarriedの列が含まれており、それらの列の情報に基づいて返り値を設定している。もしfemaleの値が0であり、さらにmarriedの値も0であれば1を返し、そうでなければ0を返す関数となっている。他の関数も同様に考えることができる。

＜ステップ２＞

上の関数を使い，wage1に新たな列を追加する。

以下のコードで使う.apply()は第１引数の関数を行または列に適用するメソッドである。axis=1は「列」を指定する引数であり、列が縦にあるように「上から下に向かって各行に関数を適用する」ことを指定している（axis='columns'でもOK）。また、引数の関数に()は書かないことに注意しよう。即ち，引数に関数名（関数を参照する「参照記号」）を引数とすることにより，関数のオブジェクトを参照していることになる。即ち、.apply()は第１引数の関数を参照し、それを実行するメソッドである。

wage1.loc[:,'singmale'] = wage1.apply(singmale, axis=1)  # axis='columns'でもOK
wage1.loc[:,'marmale'] = wage1.apply(marmale, axis=1)    # axis='columns'でもOK
wage1.loc[:,'singfem'] = wage1.apply(singfem, axis=1)    # axis='columns'でもOK
wage1.loc[:,'marfem'] = wage1.apply(marfem, axis=1)      # axis='columns'でもOK

wage1.head(3)

	wage	educ	exper	tenure	female	married	numdep	smsa	...	servocc	lwage	expersq	tenursq	singmale	singfem	marfem
0	3.10	11	2	0	1	0	2	1	...	0	1.131402	4	0	0	1	0
1	3.24	12	22	2	1	1	3	1	...	1	1.175573	484	4	0	0	1
2	3.00	11	2	0	0	0	2	0	...	0	1.098612	4	0	1	0	0

3 rows × 28 columns

表示されたDataFrameを右にスクロールするとsingmale、marmale、singfem、marfemの4つの列が追加されていることが確認できる。

一方でコードを書く際、似たものをコピペして少しだけ変更することを繰り返すとエラーや意図しない結果につながる可能性が高くなる。その場合は、次のようにforループで書くことを薦める。

まず辞書を作成する：

キーを列のラベル
値を関数名

func_dict = {'singmale':singmale,
             'marmale':marmale,
             'singfem':singfem,
             'marfem':marfem}

次にfunc_dictを使いforループで列を追加する。ここでkeyは辞書のキーを指す。

for key in func_dict:
    wage1.loc[:,key] = wage1.apply(func_dict[key], axis=1)

wage1.head(3)

	wage	educ	exper	tenure	female	married	numdep	smsa	...	servocc	lwage	expersq	tenursq	singmale	singfem	marfem
0	3.10	11	2	0	1	0	2	1	...	0	1.131402	4	0	0	1	0
1	3.24	12	22	2	1	1	3	1	...	1	1.175573	484	4	0	0	1
2	3.00	11	2	0	0	0	2	0	...	0	1.098612	4	0	1	0	0

3 rows × 28 columns

これら４つのケースを捉えるために次の回帰式を使う。

\(y=\beta_0+\beta_1D_1+\beta_2D_2+\beta_3D_3+u\)

基本カテゴリー：singmale
\(D_1\)=marmale
\(D_2\)=singfem
\(D_3\)=marfem

\(D_1=\{0,1\}\)、\(D_2=\{0,1\}\)、\(D_3=\{0,1\}\)の取る値を考慮すると，以下の４つのパターンに分けることができる。

\(D_1=0\) & \(D_2=0\) & \(D_3=0\quad\Rightarrow\quad\) \(y=\beta_0+u\)

\(D_1=1\) & \(D_2=0\) & \(D_3=0\quad\Rightarrow\quad\) \(y=\beta_0+\beta_1+u\)

\(D_1=0\) & \(D_2=1\) & \(D_3=0\quad\Rightarrow\quad\) \(y=\beta_0+\beta_2+u\)

\(D_1=0\) & \(D_2=0\) & \(D_3=1\quad\Rightarrow\quad\) \(y=\beta_0+\beta_3+u\)

即ち，OLS推定量は以下を表すことになる。

\(\hat{\beta}_0\)：未婚男性の平均賃金

\(\hat{\beta}_0+\hat{\beta}_1\)：既婚男性の平均賃金

\(\hat{\beta}_0+\hat{\beta}_2\)：未婚女性の平均賃金

\(\hat{\beta}_0+\hat{\beta}_3\)：既婚女性の平均賃金

form_const_4 = 'wage ~ marmale + singfem + marfem'

res_const_4 = smf.ols(form_const_4, data=wage1).fit()

para4 = res_const_4.params
para4

Intercept    5.168023
marmale      2.815009
singfem     -0.556440
marfem      -0.602114
dtype: float64

（結果）

未婚男性の平均賃金は約5.16
未婚男性に比べて既婚男性の平均賃金は約2.82高い
未婚男性に比べて未婚女性の平均賃金は約0.56低い
未婚男性に比べて既婚女性の平均賃金は約0.60低い

wage1のデータから直接計算して確認する。

未婚男性の平均賃金

wage1.query('female==0 & married==0')['wage'].mean()

5.168023282705351

para4.iloc[0]

5.1680232827053505

既婚男性の平均賃金

wage1.query('female==0 & married==1')['wage'].mean()

7.983031926002909

para4.iloc[0]+para4.iloc[1]

7.983031926002907

未婚女性の平均賃金

wage1.query('female==1 & married==0')['wage'].mean()

4.611583345135053

para4.iloc[0]+para4.iloc[2]

4.611583345135053

既婚女性の平均賃金

wage1.query('female==1 & married==1')['wage'].mean()

4.565909099398238

para4.iloc[0]+para4.iloc[3]

4.565909099398237

ケース２：定量的変数の導入#

１つのダミー変数femaleだけが入るケースに次の変数を加えた回帰式を考える。

educ：教育年数
exper：雇用経験年数
tenure：勤続年数

form_1 = 'wage ~ female + educ + exper+ tenure'

res_1 = smf.ols(form_1, data=wage1).fit()

res_1.params

Intercept   -1.567939
female      -1.810852
educ         0.571505
exper        0.025396
tenure       0.141005
dtype: float64

（解釈）

賃金格差は約-1.81に減少した。これはeduc, exper, tenureの影響を取り除いた結果である。言い換えると，教育，経験，就労期間を所与とすると（それらの変数が同じである場合という意味），女性の時間賃金は約1.8ドル低い。
女性差別を捉えているのだろうか？回帰式にない変数（その影響は誤差項に入っている）が残っている可能性があるので，必ずしもそうではない。またその場合，欠落変数バイアスも発生し得る。

ケース３：ダミー変数の交差項#

ケース１と２の被説明変数はwageをそのまま使ったが，ここでは対数を取り賃金方程式にダミー変数の交差項を入れて考察する。

以下の回帰式を考える。

\[y=\beta_0+\beta_1D+ \beta_2Dx+\beta_3x + u\]

ここで\(D\)がダミー変数，\(x=\)は定量的変数であり，\(Dx\)がダミー変数の交差項である。ダミー変数が取る値\(D=\{0,1\}\)に分けて考えると，以下を推定することになる。

\(D=0\quad\Rightarrow\quad y=\beta_0+\beta_3x + u\)

\(D=1\quad\Rightarrow\quad y=\left(\beta_0+\beta_1\right)+ \left(\beta_2+\beta_3\right)x + u\)

具体例として\(D=\) female，\(x=\)educとするとOLS推定量は以下を表すことになる。

\(\hat{\beta}_0\)：（教育の効果を取り除いた）男性の平均賃金（対数）

\(\hat{\beta}_3\)：男性の賃金に対する教育の効果（％）

\(\hat{\beta}_0+\hat{\beta}_1\)：（教育の効果を取り除いた）女性の平均賃金（対数）

\(\hat{\beta}_2+\hat{\beta}_3\)：女性の賃金に対する教育の効果（％）

（注意）

statsmodelsの回帰式では\(\text{female}\times \text{educ}\)を\(\text{female}:\text{educ}\)と書く。

form_2 = 'np.log(wage) ~ female + female:educ + educ + exper + tenure'

res_2 = smf.ols(form_2, data=wage1).fit()

print(res_2.summary().tables[1])

===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       0.4647      0.123      3.781      0.000       0.223       0.706
female         -0.2104      0.174     -1.209      0.227      -0.552       0.131
female:educ    -0.0072      0.014     -0.534      0.593      -0.034       0.019
educ            0.0903      0.009     10.359      0.000       0.073       0.107
exper           0.0046      0.002      2.850      0.005       0.001       0.008
tenure          0.0174      0.003      5.849      0.000       0.012       0.023
===============================================================================

\(t\)検定

female
- 教育などの影響を省いた後の平均賃金の差
- 5%有意水準で\(\text{H}_0\)female=0は棄却できない。
female:educ
- 教育などの影響を省いた後の教育の収益率の差
- 5%有意水準で\(\text{H}_0\) female:educ=0は棄却できない。

\(F\)検定

\(\text{H}_0\): female\(=\) female:educ\(=0\)の制約を考えよう。

hypotheses = 'female=0, female:educ=0'

res_2.f_test(hypotheses).pvalue

3.8945546915249165e-14

\(\text{H}_0\)は棄却される。

\(t\)検定では，femaleとfemale:educはそれぞれの帰無仮説が棄却されなかったが，\(F\)検定では制約が棄却された。一貫性がなく説明変数が不足している可能性がある。

カテゴリー変数#

カテゴリー変数とは定性的な変数であり，男女もカテゴリー変数の一種である。カテゴリー変数をダミー変数に変換するには2つの方法がある。

statsmodelsにはカテゴリー変数に自動的にダミー変数を割り当てる機能がある。操作は簡単で，単に回帰式の中でC()の中にカテゴリー変数を入れるだけである。
C()を使わない方法。

`C()`を使う方法#

式の中にダミー変数を入れるためには，変数がpandasのカテゴリー型変数に変換される必要がある。回帰式の中でC()を使うと，文字型データなどは自動的に変換される。

例１#

例として，男女のカテゴリーがあるwage1のデータセットをもう一度使う。

df = wage1.loc[:,['wage', 'female', 'educ']]

dfのメソッドreplace()を使ってfemaleの列の値を以下のルールに沿って変換し，それをdfにsexという列として入れ直す。

1 \(\Rightarrow\) female
0 \(\Rightarrow\) male

replace()の中は辞書になっている（replace()の代わりにmap()でもOK）。

df.loc[:,'sex'] = df['female'].replace({1:'female',0:'male'})

df.head(3)

	wage	female	educ	sex
0	3.10	1	11	female
1	3.24	1	12	female
2	3.00	0	11	male

sexの変数をC()に入れて回帰式を書いて計算。

form_c = 'wage ~  C(sex) + educ'

res_c = smf.ols(form_c, data=df).fit()

res_c.params

Intercept        -1.650545
C(sex)[T.male]    2.273362
educ              0.506452
dtype: float64

C(sex)[T.male]について

TはTreatmentの頭文字で，通常のダミー変数を設定することを示している。
maleはmaleの変数であることを表しており，自動的にfemaleが基本カテゴリーに設定されたことが分かる。

（結果）

C(sex)[T.male]はfemaleに比べてmaleの賃金は約2.27ドル高いことを示している。

一方で，基本カテゴリーを手動で設定したい場合がある。その場合にはC()にTreatment("＜設定したいカテゴリー＞")を引数として追加する。

（注意）Treatment()の中は double quotations " "を使用すること。

例えば，maleを基本カテゴリーとする。

form_cm = 'wage ~  C(sex,Treatment("male")) + educ'

res_cm = smf.ols(form_cm, data=df).fit()

res_cm.params

Intercept                              0.622817
C(sex, Treatment("male"))[T.female]   -2.273362
educ                                   0.506452
dtype: float64

この結果は，male基本カテゴリーとする以下の結果と同じである。

form_ca = 'wage ~  female + educ'

res_ca = smf.ols(form_ca, data=df).fit()

res_ca.params

Intercept    0.622817
female      -2.273362
educ         0.506452
dtype: float64

例２#

米国ニュージャージーとペンシルベニアのファースト・フード店で人種と所得によって価格差別をおこなっているかを検証する例を取り上げる。wooldridgeパッケージのデータセットdiscrimには以下の変数が含まれている。

フライド・ポテトの平均価格（pfries）
人口における黒人の比率（prpblck）
平均所得（income）
４つのファースト・フード店（chain; カテゴリー変数で1~4の数字が使われている）
- 1: Berger King
- 2: KFC
- 3: Roy Rogers
- 4: Wendy’s

discrim = wooldridge.data('discrim')
wooldridge.data('discrim',description=True)

name of dataset: discrim
no of variables: 37
no of observations: 410

+----------+----------------------------------------------+
| variable | label                                        |
+----------+----------------------------------------------+
| psoda    | price of medium soda, 1st wave               |
| pfries   | price of small fries, 1st wave               |
| pentree  | price entree (burger or chicken), 1st wave   |
| wagest   | starting wage, 1st wave                      |
| nmgrs    | number of managers, 1st wave                 |
| nregs    | number of registers, 1st wave                |
| hrsopen  | hours open, 1st wave                         |
| emp      | number of employees, 1st wave                |
| psoda2   | price of medium soday, 2nd wave              |
| pfries2  | price of small fries, 2nd wave               |
| pentree2 | price entree, 2nd wave                       |
| wagest2  | starting wage, 2nd wave                      |
| nmgrs2   | number of managers, 2nd wave                 |
| nregs2   | number of registers, 2nd wave                |
| hrsopen2 | hours open, 2nd wave                         |
| emp2     | number of employees, 2nd wave                |
| compown  | =1 if company owned                          |
| chain    | BK = 1, KFC = 2, Roy Rogers = 3, Wendy's = 4 |
| density  | population density, town                     |
| crmrte   | crime rate, town                             |
| state    | NJ = 1, PA = 2                               |
| prpblck  | proportion black, zipcode                    |
| prppov   | proportion in poverty, zipcode               |
| prpncar  | proportion no car, zipcode                   |
| hseval   | median housing value, zipcode                |
| nstores  | number of stores, zipcode                    |
| income   | median family income, zipcode                |
| county   | county label                                 |
| lpsoda   | log(psoda)                                   |
| lpfries  | log(pfries)                                  |
| lhseval  | log(hseval)                                  |
| lincome  | log(income)                                  |
| ldensity | log(density)                                 |
| NJ       | =1 for New Jersey                            |
| BK       | =1 if Burger King                            |
| KFC      | =1 if Kentucky Fried Chicken                 |
| RR       | =1 if Roy Rogers                             |
+----------+----------------------------------------------+

K. Graddy (1997), “Do Fast-Food Chains Price Discriminate on the Race
and Income Characteristics of an Area?” Journal of Business and
Economic Statistics 15, 391-401. Professor Graddy kindly provided the
data set.

それぞれのファースト・フード店の数を確認する。

discrim['chain'].value_counts()

chain
1    171
3     99
2     80
4     60
Name: count, dtype: int64

OLS推定をおこなう。

form_p = 'np.log(pfries) ~ prpblck + np.log(income) + C(chain)'

res_p = smf.ols(form_p, data=discrim).fit()

print(res_p.summary().tables[1])

==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         -1.0812      0.222     -4.860      0.000      -1.519      -0.644
C(chain)[T.2]     -0.0682      0.014     -4.943      0.000      -0.095      -0.041
C(chain)[T.3]      0.0811      0.013      6.215      0.000       0.055       0.107
C(chain)[T.4]     -0.0434      0.015     -2.892      0.004      -0.073      -0.014
prpblck            0.1317      0.031      4.185      0.000       0.070       0.194
np.log(income)     0.0914      0.021      4.441      0.000       0.051       0.132
==================================================================================

自動的にBerger Kingが基本カテゴリーに設定されている。

（結果）

BKとKFCの価格比較

\[\begin{split} \begin{align} \ln P_{\text{KFC}}&=\ln P_{\text{BK}}-0.0682 \\ &\Downarrow \\ \dfrac{P_{\text{KFC}}}{P_{\text{BK}}}-1&=e^{-0.0682}-1\approx-0.06593 \end{align} \end{split}\]

Berger Kingと比べてKFCの価格は約6.6%低いことがわかる。
prpblckとnp.log(income)の\(p\)値は0に近く，帰無仮説は１％有意水準でも棄却できる。

`C()`を使わない方法#

例２を使って説明する。まず説明の前準備としてdiscrimの中で使う変数だけを取り出そう。

df_c = discrim.loc[:,['pfries', 'prpblck', 'income', 'chain']]
df_c.head()

	pfries	prpblck	income	chain
0	1.06	0.171154	44534.0	3
1	0.91	0.171154	44534.0	1
2	0.91	0.047360	41164.0	1
3	1.02	0.052839	50366.0	3
4	NaN	0.034480	72287.0	1

replace()を使ってchainの列の値を以下のルールに沿って変換する。

1 \(\Rightarrow\) Berger King
2 \(\Rightarrow\) KFC
3 \(\Rightarrow\) Roy Rogers
4 \(\Rightarrow\) Wendy’s

df_c.loc[:,'chain'] = df_c['chain'].replace(
                        {1:'Berger_King',2:'KFC',3:'Roy_Rogers',4:'Wendys'})

df_c.head()

	pfries	prpblck	income	chain
0	1.06	0.171154	44534.0	Roy_Rogers
1	0.91	0.171154	44534.0	Berger_King
2	0.91	0.047360	41164.0	Berger_King
3	1.02	0.052839	50366.0	Roy_Rogers
4	NaN	0.034480	72287.0	Berger_King

DataFrameの特徴を確認する。

df_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 410 entries, 0 to 409
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   pfries   393 non-null    float64
 1   prpblck  409 non-null    float64
 2   income   409 non-null    float64
 3   chain    410 non-null    object 
dtypes: float64(3), object(1)
memory usage: 12.9+ KB

列chainはobjectとなっており文字型であることがわかる。

＜C()を使わない方法＞

pandasの関数であるpd.Categorical()を使って列chainをpandasのカテゴリー型に変換する。

df_c['chain'] = pd.Categorical(df_c['chain'])
df_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 410 entries, 0 to 409
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   pfries   393 non-null    float64 
 1   prpblck  409 non-null    float64 
 2   income   409 non-null    float64 
 3   chain    410 non-null    category
dtypes: category(1), float64(3)
memory usage: 10.3 KB

列chainはcategoryになっている。後は，chainをそのまま回帰式に使うだけである。

form_c = 'np.log(pfries) ~ prpblck + np.log(income) + chain'

res_c = smf.ols(form_c, data=df_c).fit()

print(res_c.summary().tables[1])

=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              -1.0812      0.222     -4.860      0.000      -1.519      -0.644
chain[T.KFC]           -0.0682      0.014     -4.943      0.000      -0.095      -0.041
chain[T.Roy_Rogers]     0.0811      0.013      6.215      0.000       0.055       0.107
chain[T.Wendys]        -0.0434      0.015     -2.892      0.004      -0.073      -0.014
prpblck                 0.1317      0.031      4.185      0.000       0.070       0.194
np.log(income)          0.0914      0.021      4.441      0.000       0.051       0.132
=======================================================================================

C()を使って推定した結果と同じである。

質的変数と回帰分析

Contents

質的変数と回帰分析#

ダミー変数#

ケース１：定数項だけの回帰分析#

ダミー変数なし#

ダミー変数あり：２つのケース#

ダミー変数あり：４つのケース#

ケース２：定量的変数の導入#

ケース３：ダミー変数の交差項#

カテゴリー変数#

C()を使う方法#

例１#

例２#

C()を使わない方法#

`C()`を使う方法#

`C()`を使わない方法#