Statistical Significance Test for Neural Network Classification

Model selection in neural networks can be guided by statistical procedures, such as hypothesis tests, information criteria and cross validation. Taking a statistical perspective is especially important for nonparametric models like neural networks, because the reason for applying them is the lack of knowledge about an adequate functional form. Many researchers have developed model selection strategies for neural networks which are based on statistical concepts. In this paper, we focused on the model evaluation by implementing statistical significance test. We used Wald-test to evaluate the relevance of parameters in the networks for classification problem. Parameters with no significance influence on any of the network outputs have to be removed. In general, the results show that Wald-test work properly to determine significance of each weight from the selected model. An empirical study by using Iris data yields all parameters in the network are significance, except bias at the first output neuron.


INTRODUCTION
One of the most unresolved questions in the literature on neural networks (NN) is what architecture should be used for a given problem.Architecture selection requires choosing both the appropriate number of hidden units and the connection thereof (Sarle 1994).
A desirable network architecture contains as few hidden units and connections as necessary for a good approximation of the true function, taking into account the trade-off between estimation bias and variability due to estimation errors.It is therefore necessary to develop a methodology to select an appropriate network model for a given problem.
Reed (Reed 1993) provides a survey about the usual approaches pursued in the network literature.The approaches, for example, are regularization, pruning, and stopped training.Regularization methods choose the network weights such that they minimize the network error function (e.g.sum of squared errors) plus a penalty term for the networks complexity.Another way to justify the regularization term is to formalize and interpret the method in a Bayesian framework.This was reviewed for example in Bishop (1995) andRipley (1996).
Pruning methods identify the parameters that do not 'significantly' contribute to the overall network performance.However, this 'significance' is usually not judged on the basis of test statistics (an exception is Burgess (1995), who uses F-tests to identify insignificant parts of a network model).Pruning methods use the so-called 'saliency' as a measure of a weight's importance.The saliency of a weight is defined as the increase in network model error (e.g.sum of squared errors) incurred by setting this weight to zero.The idea is to remove the weights with low saliency; however, the method does not provide any guidelines as to whether or not a saliency should be judged as low.It is shown that the computation of the saliency is generalized by a corresponding Wald-test statistics (Anders & Korn 1996).An alternative to conventional pruning methods is developed and tested in Kingdon (1997).The basic idea of the approach, called 'network regression pruning', is to remove network weights while retaining the network's mapping.A weight is seen as redundant if, after its removal, the original mapping can In statistical terms, the method tries to make up the *Telp: 081276430115 Email: *sri_rezeki_uir@yahoo.com,**subanar@yahoo.com,***guritno0@mailcity.ac.idRezeki et al.
k q  matrices with full row rank q for all n sufficiently large, uniformly in n .Then the sequence , and for all n sufficiently large, n c is interior to C , uniformly in n .
Then ( ) ( ) 0 I , and suppose there exists ˆn V positive semi definite and symmetric such that O , and for all n sufficiently large, det( ) The Wald statistic allows the simplest analysis, although it may or may not the easiest statistic to compute in a given situation.The motivation for the Wald statistic is that when the null hypothesis is correct ˆn Sw should be close to   Sw s , so a value of ˆn  Sw s far from zero is evidence against the null hypothesis.
The theorem about Wald statistic that be used for hypothesis testing of parameters in NN model is adapted from White (1999) Theorem 4.31.The result of adaptation for our specific purpose is as follow: Theorem 2.6.Let the conditions of Theorem 2.2 hold, i.e.: , where , and be (approximately) recovered by adjusting the remaining weights of the effected node.In the application of stopped training, the data set is split into a training and a validation set.If the model errors in the validation set begin to grow during the training process, the training algorithm is stopped.
A , and significance of these estimators are presented at ,( , )l z  is continuous on W. Suppose further that v R and E(d(Zt)) <  ( i.e., l is dominated on W by an integrable function).*W interior to W . Suppose in addition that for each

Table 1 .
Usually, neurons in NN are always full connectivity.By using this Wald test, it is possible that there are weights which are not significance.Based on Table1, all of the weights are significance except bias at the first output neuron.Therefore, the connection should be removed and the best model is not fully connected.Figure 1b.Architecture of NN (4-1-3) with the value of weights estimation