|
WANGA family algorithms [1] are proposed for filling the empty cells in
the table, containing data of different
types [2]. Let us start from the description of the algorithms
for the table, where all n features are measured in one scale of relations.
Let A is the table with missed element a(ij). All other elements of the
table are known. To select the competent subtable with the size of s strings
and q columns we will use the following procedure. Let us select the element
a(lk). The intersections of i-th and l-th strings with j-th and k-th columns
contain four table elements: a(lk), a(lj), a(ik) and unknown element a(ij).
If the properties are connected by strong direct regularity, then the relation
of two elements of k-th column will be the same or almost the same, as
the relation of two elements of j-th column. From the assumed equity of
relations a(lk) :a(ik) = a(lj) :a(ij) it is possible to obtain the a’ variant
of the estimation («prompt») of the element to be filled: a'lk
= a(lj)*a(ik):a(lk). Repeating this procedure for all other table elements,
we will receive (n-1)*(m-1) of prompt variants: a'11, a'12,..., a'lk,...,
a'(n-1)(m-1).
If the data in the table A are measured in the scale of intervals, then the minimum «prompting» element in the WANGA-I algorithm will be the submatrix, consisting of six elements situated at the intersection of two columns (j-th and k-th) and three strings (i-th, l-th and t-th). Invariant of interval scale is the relation of differences of any two couples of elements from the same column. Basing on this and on the hypothesis of direct connection between j-th and k-th columns, it is possible to write: 2 G = C * (n-1). (m-1) Let us select all prompts, obtained with participation of l-th string elements (their number will be (m-2)*(n-1)) and find their dispersion Dl and deriving from it competence of l-th string Ll = 1/(Dl+1). We will select s most competent strings on the base of such estimations. Let us find in the similar way also q most competent columns, forming, as a result, the competent submatrix s*q. Competence of each element of this submatrix is L(lk) = Ll*Lk. The prompts from the elements of this submatrix will be found by the prescribed method and the final variant of missing element filling will be obtained by averaging of prompts according to their weights L(lk). The confidence to this result may be estimated by the value P = 1:(D+1), where D is dispersion of all prompts of competent submatrix. If all properties in the table A are measured in the scale of orders, then the empty cells in it are filled by WANGA-P algorithm. Let us transform all columns, reducing their values to the scale of normalised ranks. The judgements of the following kind are invariant to scale transformations: if a(lk)>a(ik), then a(lj)>a(ij). From here we receive one of the three possible «prompt» variants: m H = - S pv*log pv , where pv is the fraction of «votes» for rank v. v=1 The competence of k-th column is taken equal to Lk = 1 / (H+1). In the case of «unanimous» voting for the same rank the positive value Lk will reach the maximum meaning equal to 1. If in the process of work with all columns we will accumulate the prompts, which were received under participation of l-th string elements, then the prescribed method will give the estimation of its competence Ll. Using these estimations it is possible to select the competent submatrix A’, containing s strings and q columns. Competence of each element a(lk) of this subtable will be taken equal to L(lk)=Li*Lk. Let us repeat the definition of prompts with participation of all s*q elements from A’. Now to obtain the overall results we will add into accumulators the corresponding fractions not of the unit, but of the L(lk) value. Distribution of votes between ranks helps to find the final solution: the missed parameter receives the value of the rank, which received the maximum number of votes. Entropy of the obtained votes distribution H will permit us to estimate the confidence degree to this result P = 1/ (H+1). The table A, where the properties are measured in the scale of names or titles, is considered. The name of the missed element a(ij) may be one of the d names (1, 2,..., f,..., d), contained in j-th column, or the new (d+1)-th name x. Let us look through all prompting fours of elements, situated at the intersections of i-th and l-th strings and j-th and k-th columns. We will be finding a prompt, starting from the following assumption: if i-th and l-th elements of k-th column have the same name, then the j-th column element must have identical names in the i-th and l-th strings, i.e. Every element of the subtable gives its own prompt, which is taken into account in the counter of «for» and «against» votes with the corresponding competence weight. If the wf values will be negative for all names, then it may be concluded, that the missed name is not present among d existing names. The name with the maximum value of w is selected from the names with positive w and this name is inserted into the empty cell. The confidence degree to the obtained decision may be the value, found via the entropy of votes distribution, as it was described above. If the initial table has more than one empty cells, then it is possible to predict each missed element independently from others («parallel» strategy) or in turn, using all elements, both initial and predicted at the previous steps ones («consecutive» strategy). If the consecutive strategy is applied, the prediction should be started from the element, providing maximum confidence degree. 1. Zagoruiko N.G., Elkina V.N., Lbov G.S. The algorithms for discovering of empirical regularities. Published by «Nauka», Novosibirsk, 1985 – 110 p. (in Russian) 2. Supes P., Zines G. The basis of measurement theory/ Psychological measurements – «Mir», Moscow, 1967, pp.9-100. |