Jiangtang's profile技止于此BlogListsNetwork Tools Help

Blog


    11/21/2007

    常见数据挖掘错误:识别和纠正(译稿)

    与ttnn的几个朋友一起翻译的一篇稿子,是SAS公司Doug Wielenga在今年SAS全球论坛的一篇会议论文,Identifying and Overcoming Common Data Mining Mistakes.翻译这篇文章得到了SAS公司与作者本人的许可,并许诺不做于商业用途。

    你可以自由散发这个译本,并保证不用做商业用途,引用时请注明“《ttnn BI 观点》集体翻译”。有任何问题可以与译者联系,联系信息在文档的末尾。

    翻译这篇文字的过程,见《常见数据挖掘错误:识别和纠正》翻译告捷!》;

    译稿下载,在http://groups.google.com/group/ttnn/web/kuihuabaodian.pdf

    原稿,在http://www.iapa.org.au/Environments/edoras/Resources/IAPA/SAS%20Global%20Forum%200732007.pdf.

    附,这篇文章的结构如下:

    Abstract
    Introduction

    1.          Preparing The Data

    1.1           Failing To Consider Enough Variables

    1.2           Incorrectly Preparing Or Failing To Prepare Categorical Predictors

    1.2.1      Too Many Overall Levels

    1.2.2      Levels That Rarely Occur

    1.2.3      One Level That Almost Always Occurs

    1.3           Incorrectly Preparing Or Failing To Prepare Continuous Predictors

    1.3.1      Extremely Skewed Predictors

    1.3.2      A Spike And A Distribution

    1.3.3      One Level That Almost Always Occurs

    1.3.4      Ignoring Or Misusing Time-Dependent Information

    2            Defining Roles, Performing Sampling, And Defining Target Profiles

    2.1           Inappropriate Metadata

    2.2           Inadequate Or Excessive Input Data

    2.3           Inappropriate Or Missing Target Profile For Categorical Target

    2.4           Target Variable Event Levels Occurring In Different Proportions

    2.5           Differences In Misclassification Costs

    3            Partitioning The Data

    3.1           Misunderstanding The Roles Of The Partitioned Data Sets

    3.2           Failing To Consider Changing The Default Partition

    4            Choosing The Variables

    4.1           Failing To Evaluate The Variables Before Selection

    4.2           Using Only One Selection Method

    4.3           Misunderstanding Or Ignoring Variable Selection Options

    4.3.1      Choosing Settings In The ??2 Mode

    4.3.2      Choosing Settings In The R2 Mode

    5            Replacing Missing Data

    5.1           Failing To Evaluate Imputation Method

    5.2           Overlooking Missing Value Indicators

    6            Fitting Linear Regression Models

    6.1           Overusing Stepwise Regression

    6.2           Inaccurately Interpreting The Results

    7            Fitting Decision Tree Models

    7.1           Ignoring Tree Instability

    7.2           Ignoring Tree Limitations

    8            Fitting Neural Network Models

    8.1           Failing To Do Variable Selection

    8.2           Failing To Consider Neural Networks

    9            Comparing Fitted Models

    9.1           Misinterpreting Lift

    9.2           Choosing The Wrong Assessment Statistic

    10        Scoring New Data

    10.1        Generating Inefficient Score Code

    10.2        Ignoring The Model Performance

    11        Clustering Your Data

    11.1        Building One Cluster Solution

    11.2        Including (Many) Categorical Variables

    12        Performing Association And Sequence Analysis

    12.1        Failing To Sort The Data Set

    12.2        Failing To Manage The Number Of Outcomes

    Conclusion

    References

    Acknowledgments

    Contact Information

    Comments

    Please wait...
    Sorry, the comment you entered is too long. Please shorten it.
    You didn't enter anything. Please try again.
    Sorry, we can't add your comment right now. Please try again later.
    To add a comment, you need permission from your parent. Ask for permission
    Your parent has turned off comments.
    Sorry, we can't delete your comment right now. Please try again later.
    You've exceeded the maximum number of comments that can be left in one day. Please try again in 24 hours.
    Your account has had the ability to leave comments disabled because our systems indicate that you may be spamming other users. If you believe that your account has been disabled in error please contact Windows Live support.
    Complete the security check below to finish leaving your comment.
    The characters you type in the security check must match the characters in the picture or audio.

    To add a comment, sign in with your Windows Live ID (if you use Hotmail, Messenger, or Xbox LIVE, you have a Windows Live ID). Sign in


    Don't have a Windows Live ID? Sign up

    Trackbacks

    The trackback URL for this entry is:
    http://johnthu.spaces.live.com/blog/cns!2053CD511E6D5B1E!336.trak
    Weblogs that reference this entry
    • None