|
The ZET algorithm [1,2] is based on three assumptions. The first one (hypothesis
of excessiveness) consists in excessiveness of real table, revealing in
existence of similar objects (strings) and interrelating properties (columns).
Here among all possible kinds of dependencies between columns (strings)
ZET algorithm applies the linear dependencies only. If the excessiveness
is absent (as, for example, in the table of random numbers), then it is
impossible to prefer one prediction to another one.
The second assumption (hypothesis of similarity, following from the hypothesis of compactness) consists in the statement that, if some pair of objects is closed in the values of (n-1)-th property, then it will be closed in the value of n-th property as well. The third assumption (hypothesis of local competence) consists in the proposition, that excessiveness has the local character: every object has its own submultitude of objects-analogues and every property has its own submultitude of properties-analogues. If it is so, then there is no need to use for the prediction of some element value bij the information, contained in the strings, different from i-th string and in columns, different from j-th column. Only so called «competent» strings, selected separately for every predicted element, should participate in prediction. In the operation of ZET algorithm it is possible to distinguish three stages: 1. At the first stage the submultitude of «competent» strings and then submultitude of «competent» columns for these strings are selected for the given empty cell of initial table «object-property» with columns normalised according to dispersion. 2. At the second stage the parameters of the formula, used for the prediction of the missed element, providing the minimum expected prediction error, are automatically defined. 3. At the third stage the prediction of the element itself according to this formula is performed. The «competence» of l-th string in relation to i-th string is understood as value L(il), inversely proportional to the distance between these strings. The «competence» of k-th column in relation to j-th column L(jk) is proportional to the modulus of correlation coefficient between them. According to the user instructions the program chooses the submatrix of any size from 2*2 to n*m. Usually the submatrix, containing from 3 to 7 strings and columns is used. In the process of prediction of empty cell value the «prompts» b(k) are elaborated with application of dependencies between j-th and all other (k-th) columns. To obtain them the linear regression equation between j-th and k-th columns is used. If submatrix contained q+1 column, then q prompts are later averaged with weight, proportional to the competence of the corresponding column. As a result, the predicted value b(j), created by the excessiveness of columns, is produced: The procedure of filling the gap with application of dependence between i-th string and alls other (l-th) strings (1,2,..l...s) is similar to the prescribed one and is performed by means of formula To define the expected prediction error the dispersion (dis) of prompts values b(k) and b(l), obtained from all k columns and l strings of competent submatrix, is calculated. High dispersion indicates the absence of stable regular connection between ij element and other submatrix elements. It is clear, that in these conditions it is unreasonable to rely upon the high accuracy of bij value prediction. Experiments showed that correlation coefficient between dispersion dis and prediction error dij reaches the value of +0.7. For different applied problems the numerous modifications of basic algorithm ZET were made, different in their purposes and in the set of different operation regimes. The programs of filling the gaps may work in one of the following regimes: 1. Filling of all empty cells. 2. Filling of only those cells with expected error not exceeding some fixed value. 3. Filling of gaps only on the base of information present in the initial table. 4. Filling of every next cell with application of initial information and predicted values for earlier filled empty cells. For each of these variants there are few modes of printing of intermediate and final results. The family of programs, based on ZET algorithms, is used for the solution of various applied problems of data analysis. 1. Zagoruiko N.G., Elkina V.N., Timerkaev V.S. Algorithm for filling of empty cells in empirical tables (ZET algorithm). «Computing systems», Novosibirsk, 1975, v.61 – Empirical prediction and pattern recognition. Zagoruiko N.G., Elkina V.N., Lbov G.S., Emelianov S.V. OTEKS applied software. «Finances and statistics», Moscow, 1986. 157 p. |