| Jiangtang's profile技止于此BlogListsNetwork | Help |
|
11/21/2007 常见数据挖掘错误:识别和纠正(译稿)与ttnn的几个朋友一起翻译的一篇稿子,是SAS公司Doug Wielenga在今年SAS全球论坛的一篇会议论文,Identifying and Overcoming Common Data Mining Mistakes.翻译这篇文章得到了SAS公司与作者本人的许可,并许诺不做于商业用途。 你可以自由散发这个译本,并保证不用做商业用途,引用时请注明“《ttnn BI 观点》集体翻译”。有任何问题可以与译者联系,联系信息在文档的末尾。
附,这篇文章的结构如下: Abstract 1. Preparing The Data 1.1 Failing To Consider Enough Variables 1.2 Incorrectly Preparing Or Failing To Prepare Categorical Predictors 1.2.1 Too Many Overall Levels 1.2.2 Levels That Rarely Occur 1.2.3 One Level That Almost Always Occurs 1.3 Incorrectly Preparing Or Failing To Prepare Continuous Predictors 1.3.1 Extremely Skewed Predictors 1.3.2 A Spike And A Distribution 1.3.3 One Level That Almost Always Occurs 1.3.4 Ignoring Or Misusing Time-Dependent Information 2 Defining Roles, Performing Sampling, And Defining Target Profiles 2.1 Inappropriate Metadata 2.2 Inadequate Or Excessive Input Data 2.3 Inappropriate Or Missing Target Profile For Categorical Target 2.4 Target Variable Event Levels Occurring In Different Proportions 2.5 Differences In Misclassification Costs 3 Partitioning The Data 3.1 Misunderstanding The Roles Of The Partitioned Data Sets 3.2 Failing To Consider Changing The Default Partition 4 Choosing The Variables 4.1 Failing To Evaluate The Variables Before Selection 4.2 Using Only One Selection Method 4.3 Misunderstanding Or Ignoring Variable Selection Options 4.3.1 Choosing Settings In The ??2 Mode 4.3.2 Choosing Settings In The R2 Mode 5 Replacing Missing Data 5.1 Failing To Evaluate Imputation Method 5.2 Overlooking Missing Value Indicators 6 Fitting Linear Regression Models 6.1 Overusing Stepwise Regression 6.2 Inaccurately Interpreting The Results 7 Fitting Decision Tree Models 7.1 Ignoring Tree Instability 7.2 Ignoring Tree Limitations 8 Fitting Neural Network Models 8.1 Failing To Do Variable Selection 8.2 Failing To Consider Neural Networks 9 Comparing Fitted Models 9.1 Misinterpreting Lift 9.2 Choosing The Wrong Assessment Statistic 10 Scoring New Data 10.1 Generating Inefficient Score Code 10.2 Ignoring The Model Performance 11 Clustering Your Data 11.1 Building One Cluster Solution 11.2 Including (Many) Categorical Variables 12 Performing Association And Sequence Analysis 12.1 Failing To Sort The Data Set 12.2 Failing To Manage The Number Of Outcomes Conclusion References Acknowledgments Contact Information Technorati Tags: 数据挖掘, 错误与识别, ttnn BI观点, SAS, Data mining, Doug Wielenga, Identifying and Overcoming Common Data Mining Mistakes TrackbacksThe trackback URL for this entry is: http://johnthu.spaces.live.com/blog/cns!2053CD511E6D5B1E!336.trak Weblogs that reference this entry
|
|
|