CLSWeb Main
Caltech Library System
Electronic Theses
                  About | Browse | Search | Caltech Student Instructions

Nicholson, Alexander (2002-05-16) Generalization error estimates and training data valuation. http://resolver.caltech.edu/CaltechETD:etd-09062005-083717


Type of Document Dissertation
Author Nicholson, Alexander
Author's Email Address zander AT fantastivision.com
URN etd-09062005-083717
Persistent URL http://resolver.caltech.edu/CaltechETD:etd-09062005-083717
Title Generalization error estimates and training data valuation
Degree PhD
Option Computer Science
Advisory Committee
Advisor Name Title
Yaser S. Abu-Mostafa Committee Chair
Keywords
  • none
Date of Defense 2002-05-16
Availability unrestricted
Abstract
This thesis addresses several problems related to generalization in machine learning systems. We introduce a theoretical framework for studying learning and generalization. Within this framework, a closed form is derived for the expected generalization error that estimates the out-of-sample performance in terms of the in-sample performance. We consider the problem of overfitting and show that, using a simple exhaustive learning algorithm, overfitting does not occur. These results do not assume a particular form of the target function, input distribution or learning model, and hold even with noisy data sets. We apply our analysis to practical learning systems, illustrate how it may be used to estimate out-of-sample errors in practice, and demonstrate that the resulting estimates improve upon errors estimated with a validation set for real world problems.

Based on this study of generalization, we develop a technique for quantitative valuation of training data. We demonstrate that this valuation may be used to select training sets that improve generalization performance. With a reasonable prior over target functions, it further allows us to estimate the level of noise in a data set and provides for detection and correction of noise in individual examples. Finally, this data valuation can be used to classify new examples, yielding a new learning algorithm that is shown to be relatively robust to noise.

Files
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  Nicholson_a_2002.pdf 3.50 Mb 00:16:12 00:08:20 00:07:17 00:03:38 00:00:18

Browse All Available ETDs by ( Author | Option )

If you have more questions or technical problems, please Contact the Caltech Library System.