Jiangtang's profile技止于此BlogListsNetwork Tools Help

Blog


    5/4/2007

    SAS Logistic回归:一个完整例子

                                                    SAS Logistic回归:代码及输出报告详解

    这篇将作为五一后一个讲稿的阅读材料之一,先整出来就搁这。如果没有耐心读下去,你可以立即转到以下的参考资料,该篇所有的知识都来自它们:

    1. Cody, R.F. and Smith, J.K. Applied Statistics and the SAS Programming Language,4th ed..NJ:  Prentice-Hall,1997.这书已经出第五版了,北大图书馆只有这第四版。非常容易上手的一本书,前半部分用input和datalines让读者专心做统计,后半部分从导入导出数据开始阐述SAS的通用编程语言。这本书用的是SAS8.这里我们只关注它第九章Multiple-Regression Analysis的最后Logistic Regression部分。我这篇的例子即来于此,有简化;
    2. SAS OnlineDoc V8, 或者SAS OnlineDoc V9, 是要花功夫熟悉它们的结构了。以前我四处下载了数G的电子书,现在才发觉还是它们好使。体例上V8和V9一样,你找到SAS/STAT-->SAS/STAT User's Guide-->The LOGISTIC Procedure, 就可以跟着学习了,文字都非常简明。

    Logistic回归处理因变量是分类型变量如“0、1”的情形。一下就假设你至少对它模模糊糊有些印象,比如说我们用p表示正例(如输出变量为“1”)的概率,那么p/(1-p)就被称作odds ratio,对p做logit变换记做logit(p),它等于log(p/(1-p),我们回归方程的形式就如logit(p)=log(p/(1-p)=a+bx,你可以把它理解成向量形式。

    假设我们有一个数据,45个观测值,四个变量,包括:

    1. age(年龄,数值型);
    2. vision(视力状况,分类型,1表示好,0表示有问题);
    3. drive(驾车教育,分类型,1表示参加过驾车教育,0表示没有) 和
    4. 一个分类型输出变量accident(去年是否出过事故,1表示出过事故,0表示没有)。我们的目的就是要考察前三个变量与发生事故的关系。

    为了保持程序的可读性,以下我就直接在代码中一行一行敲入数据,在实际工作中,超过20个观测值的数据就要写程序导入了。另外,/*   */之间的语句都是注释,SAS运行时会把它们忽略掉:

    ----------------------------------------------------------------------------------------------------------------------------

    /*

    前三行,首先data步,建立一个叫logistic的临时数据集,它在work临时文件夹里。一般推荐用记事本把程序保存起来;input语句表示一下是几个变量;从datalines开始就一行一行输入数据,每个单独的数据之间用空格隔开,每个观测值各占一行。需要注意的是,datalines和下面的数据之间不要有空行,否则SAS会认为出现了一个缺失值。数据输入完毕,以分号结束,这个分号一定要另起一行。

    */

    data logistic;
        input accident age vision drive;
    datalines;
    1 17 1 1
    1 44 0 0
    1 48 1 0
    1 55 0 0
    1 75 1 1
    0 35 0 1
    0 42 1 1
    0 57 0 0
    0 28 0 1
    0 20 0 1
    0 38 1 0
    0 45 0 1
    0 47 1 1
    0 52 0 0
    0 55 0 1
    1 68 1 0
    1 18 1 0
    1 68 0 0
    1 48 1 1
    1 17 0 0
    1 70 1 1
    1 72 1 0
    1 35 0 1
    1 19 1 0
    1 62 1 0
    0 39 1 1
    0 40 1 1
    0 55 0 0
    0 68 0 1
    0 25 1 0
    0 17 0 0
    0 45 0 1
    0 44 0 1
    0 67 0 0
    0 55 0 1
    1 61 1 0
    1 19 1 0
    1 69 0 0
    1 23 1 1
    1 19 0 0
    1 72 1 1
    1 74 1 0
    1 31 0 1
    1 16 1 0
    1 61 1 0
    ;

    /*

    以下两行我们不妨称作过程步甲。过程步以proc开头,加上要实现功能的名字,这里是logistics,接下来是要引用的数据。值得注意的是那个desending选项。SAS的Logistic回归方程log(odds)默认的形式是处理那个变量值比较小的,这里是accident=0,但我们要考察的是发生事故accident=1的情况,加上desending降序排列,它就处理accident=1的log(odds)了。再model引导的就是回归方程的形式,写成“因变量=自变量1    自变量2   自变量3”的样子。最后以run结束语句,与proc对应。

    */

    proc logistic data=logistic descending;
         model accident=age vision drive; run;

    /*

    运行以上程序,就要跑出一大堆结果了。但在处理多元回归时,语句很难得会只像过程步甲一样简洁。以下过程步乙只加入一个变量选择选项forward。SAS在处理自变量选择上采用了5个技术,这里只简单提一下3个常用的技术。

    1.forward——前向选择,变量一个个进入回归方程,按照一些卡方标准,最显著的也就是最好的变量最先进入,然后就是次最好的,以此类推。某个变量一旦进入,就不再退出;

    2.backward——后向剔除,一开始全部变量都进入回归方程,又按照一些标准,把最不好的变量一一剔除;

    3.stepwise——逐步回归,这个跟forward有些类似,不同的是,stepwise在变量进入以后还有一个backward后向剔除的过程,而在forward里,变量一旦进入,就不再退出。

    */ 

    proc logistic data=logistic descending;
       model accident=age vision drive  /
                    selection=forward;run;

    ----------------------------------------------------------------------------------------------------------------------------

    用上面这个过程步乙替代过程步甲,再运行一遍。这两个过程步的输出结果大同小异,只是过程步乙多了个forward选项。以下用的是过程步乙的输出结果,其中黑体字是输出结果本身,我做的注释语句以红笔描出,包括数字编号。 

                     (1)  The SAS System 10:47 Tuesday, May 4, 2007 1

                                 The LOGISTIC Procedure

                                   Model Information

    Data Set                                         WORK.LOGISTIC
    Response Variable                          accident
    Number of Response Levels            2
    Number of Observations                 45
    Model                                              binary logit
    Optimization Technique                 Fisher's scoring

     

                                 Response Profile                                     
    Ordered  Value     accident          Total Frequency

           1                        1                             25
           2                        0                             20

    Probability modeled is accident=1.

    (1) 给出了本模型的基本信息,意思大多自明。需要注意的是Response Profile 中,accident=1排在首位。前面我们说过,SAS的Logistic回归方程log(odds)默认的形式是处理那个变量值比较小的,加上descending选项后,accident=1就排在首位了。

                          (2)   Forward Selection Procedure

    Step 0. Intercept entered:

                              Model Convergence Status

    Convergence criterion (GCONV=1E-8) satisfied.

                              Residual Chi-Square Test

    Chi-Square              DF                Pr > ChiSq

      10.7057                   3                     0.0134

     

    Step 1. Effect vision entered:

                                  Model Convergence Status

    Convergence criterion (GCONV=1E-8) satisfied. 
                                   Model Fit Statistics                                               
    Criterion     Intercept Only     Intercept and Covariates
    AIC                 63.827                   59.244
    SC                   65.633                   62.857
    -2 Log L           61.827                  55.244

                           Testing Global Null Hypothesis: BETA=0
    Test                       Chi-Square    DF    Pr > ChiSq
    Likelihood Ratio      6.5830         1         0.0103
    Score                       6.4209         1         0.0113
    Wald                        6.0756         1         0.0137

                           Residual Chi-Square Test
    Chi-Square            DF      Pr > ChiSq
          4.9818              2         0.0828

    Step 2. Effect drive entered:
                         Model Convergence Status
                      Convergence criterion (GCONV=1E-8) satisfied.                      

                             Model Fit Statistics
    Criterion      Intercept Only     Intercept and Covariates
    AIC                  63.827                     56.287
    SC                   65.633                     61.707
    -2 Log L           61.827                     50.287

     

                     Testing Global Null Hypothesis: BETA=0
    Test                        Chi-Square     DF   Pr > ChiSq
    Likelihood Ratio         11.5391        2      0.0031
    Score                          10.5976        2      0.0050
    Wald                            8.5949         2      0.0136
                 

                     Residual Chi-Square Test
    Chi-Square     DF   Pr > ChiSq
         0.1293        1     0.7191

    NOTE: No (additional) effects met the 0.05 significance level for entry into the model.

    (2) 给出了自变量进入模型的次序。先是截距项Step 0了,不管它。Step 1 vision第一个进入模型,附带了很多评估它对因变量预测能力的指标。-2 Log L 和Score 用来检测自变量是否显著。-2 Log L中的L就是Likelihood Ratio,它的p值是0.0103Score  的p值是0.0113,都小于0.05,故vision是一个很显著的解释变量。还有,AIC(Akaike Information Criterion)和SC(Schwarz Criterion)两个信息量标准用来比较不同的模型,它们数值越小,模型变现就越好,这个接下来我们看step2 drive变量进入模型后的情况。 我们可以看到模型的表现变好了,因为这是AICSC的值变小了,-2 Log L 和Score对应的p值也更小。

                    (3)   Summary of Forward Selection
    Step    Effect Entered        DF      Number In      Score Chi-Square        Pr > ChiSq
    1            vision                    1           1                      6.4209                        0.0113
    2            drive                     1            2                     4.8680                         0.0274

     (3) 总结了我们模型使用的前向选择方法,包括自变量进入模型的次序,以及每个自变量的卡方值和p值。

                        (4) Analysis of Maximum Likelihood Estimates
    Parameter   DF   Estimate    Standard Error  Wald Chi-Square   Pr > ChiSq
    Intercept     1      0.1110               0.5457           0.0414                 0.8389
    vision           1      1.7137               0.7049           5.9113                 0.0150
    drive            1     -1.5000               0.7037           4.5440                 0.0330

    (4) 给出了模型参数的估计,据此可以写出改回归方程的形式是log(odds of having an accident)=log(p/(1-p))=0.1110+1.7137*vision-1.5000*drive。我们知道,odds=p/(1-p),有p=odds/(1+odds)。假设有个哥们,视力没问题但没有受过驾车教育(vision=0,drive=0),代入方程,有log(odds)=0.1110,再odds=exp(0.110)=1.1174,p=1.1174/2.1174=0.5277,即我们说这人发生事故的概率为0.5277;又另一个,视力有问题同样没受过驾车教育(vision=1,drive=0),同样的步骤得log(odds)=1.8249,odds=exp(1.8249)=6.2022,p=6.2022/7.2022=0.8612,即这人发生事故的概率为0.8612(视力多重要)。

                     (5) Odds Ratio Estimates
    Effect       Point Estimate               95% Wald Confidence Limits
    vision            5.550                               1.394           22.093
    drive             0.223                               0.056            0.886

     (5) 是对比率Odds Ratio的估计。这里对vision的odds ratio的点估是 5.550 ,这个比率由odds(having an accident|vision=1,drive=0)比odds(having an accident|vision=0,drive=0)得来,根据(4) 的计算,就是6.2022/1.1174=5.550.还有,对vision来说,95% 的置信区间不包括1,说明vision是一个非常显著的解释变量(比率的置信区间不包括1,就跟p值小于0.05一样是一个规则)。

                     (6) Association of Predicted Probabilities and Observed Responses
    Percent Concordant             67.2               Somers' D 0.532
    Percent Discordant              14.0               Gamma 0.655
    Percent Tied                        18.8                Tau-a 0.269
    Pairs                                     500                  c 0.766

    (6) 这个东西就有些复杂,大概说些预测概率与观测道德因变量间的关联性,我们看到一致性比率Percent Concordant 为67.2 %,不一致性比率Percent Discordant 为14.0%,说明预测值与观测值在现有水平上有较强的关联性,回归模型有很强的预测能力。

    最后注意到,上面我们用过程步乙得出的输出结果没有age这个自变量,用过程步甲得出的输出有,这不是因为年龄在这个预测模型中不重要,而是上面以数值型面目出现的年龄在是否出现事故的两组人中分布不均匀,为了解决这个问题,把年龄分组就是,不多说了。

    Comments (5)

    Please wait...
    Sorry, the comment you entered is too long. Please shorten it.
    You didn't enter anything. Please try again.
    Sorry, we can't add your comment right now. Please try again later.
    To add a comment, you need permission from your parent. Ask for permission
    Your parent has turned off comments.
    Sorry, we can't delete your comment right now. Please try again later.
    You've exceeded the maximum number of comments that can be left in one day. Please try again in 24 hours.
    Your account has had the ability to leave comments disabled because our systems indicate that you may be spamming other users. If you believe that your account has been disabled in error please contact Windows Live support.
    Complete the security check below to finish leaving your comment.
    The characters you type in the security check must match the characters in the picture or audio.

    To add a comment, sign in with your Windows Live ID (if you use Hotmail, Messenger, or Xbox LIVE, you have a Windows Live ID). Sign in


    Don't have a Windows Live ID? Sign up

    鹏 张wrote:
    你好,很谢谢你的这篇文章。
    解释的非常清楚和生动。
    不过我还有两个疑问,
    1,在什么情况下用logistic回归。这与平常用的logistic成长模型有什么联系?
    2,最后为什么没有年龄这个自变量还不清楚。能把分组的解决过程和用过程甲输出的结果写一下么?
    June 30
    Jiangtang Huwrote:
    to lisa:

    有时候它们混用,因为都是比率。但也可以区别开来,正如你指出的那样,odds ratio就是比率的比率了。比如,下面一个来自ucla(http://www.ats.ucla.edu/stat/,这个网站就可以作为学习的起步)的例子就很好:

    SAS FAQ: How do I interpret odds ratios in logistic regression?

    Let's begin with probability. Let's say that the probability of success is .8, thus

          p = .8

    Then the probability of failure is

          q = 1 - p = .2

    The odds of success are defined as

          odds(success) = p/q = .8/.2 = 4,

    that is, the odds of success are 4 to 1. The odds of failure would be

          odds(failure) = q/p = .2/.8 = .25,

    that is, the odds of failure are 1 to 4. Next, let's compute the odds ratio by

          OR = odds(success)/odds(failure) = 4/.25 = 16

    The interpretation of this odds ratio would be that the odds of success are 16 times greater than for failure. Now if we had formed the odds ratio the other way around with odds of failure in the numerator, we would have gotten something like this,

          OR = odds(failure)/odds(success) = .25/4 = .0625

    Interestingly enough, the interpretation of this odds ratio is nearly the same as the one above. Here the interpretation is that the odds of failure are one-sixteenth the odds of success. In fact, if you take the reciprocal of the first odds ratio you get

          1/16 = .0625

    更多,见http://www.ats.ucla.edu/stat/SAS/faq/oratio.htm
    Dec. 14
    璇 刘wrote:
    从这段看,好像odds和odds ratio不是一个东西:
    "  (5) 是对比率Odds Ratio的估计。这里对vision的odds ratio的点估是 5.550 ,这个比率由odds(having an accident|vision=1,drive=0)比odds(having an accident|vision=0,drive=0)得来,根据(4) 的计算,就是6.2022/1.1174=5.550.还有,对vision来说,95% 的置信区间不包括1,说明vision是一个非常显著的解释变量(比率的置信区间不包括1,就跟p值小于0.05一样是一个规则)。 " 
     
    请问我如果没有这些统计的背景知识,现在又比较需要这种关于regression方面知识,有没有比较好的书入门...
    Sept. 12
    Jiangtang Huwrote:
    1.不好意思,“一下就假设你至少对它模模糊糊有些印象”敲错字了,应该是“以下……”。因为讲得简略,就这么说了。
    2.p/(1-p)称odds,又称为odds ratio。好像是这样的。
    Sept. 12
    璇 刘wrote:
    hi,最近我在学习有关logit modeling的东西,觉得你的学习笔记很有意义,不过有个问题,我读到第二段:
    "Logistic回归处理因变量是分类型变量如“0、1”的情形。一下就假设你至少对它模模糊糊有些印象,比如说我们用p表示正例(如输出变量为“1”)的概率,那么p/(1-p)就被称作odds ratio"
    p/(1-p)是odds?还是 odds ratio?
    然后我继续阅读中.....
     
    Sept. 11

    Trackbacks (2)

    The trackback URL for this entry is:
    http://johnthu.spaces.live.com/blog/cns!2053CD511E6D5B1E!149.trak
    Weblogs that reference this entry