BASIC HYPOTHESES FOR DATA ANALYSIS METHODS

     Rigorous mathematical methods for constructing statistical decision functions have been developed for the cases where  all  information about the distributions of general populations of patterns needed to solve a problem is available. Unfortunately, such conditions are not met with in solving real pattern recognition problems. It is necessary to complement the available objective information by a number of subjectively chosen assumptions or hypotheses. The paper presents two basic hypotheses - those of "compactness" and "l-compactness" - and shows their influence on the character of data analysis algorithms.
     The hypothesis of compactness (H) consists in the following : realisations of one and the same pattern are commonly represented in the space of features as closely spaced points, forming «compact» clusters [1]. Though the above hypothesis seems trivial and easy to rule out, it forms the basis of the majority of recognition algorithms and the algorithms of other data analysis problems.
     Let us call n features contained in informative subset X "describing", and nominal (n+1) feature z specifying the name of the pattern - "goal". Denote the set of objects of the training sample by A, a new object to be recognised - by q, and the fact that the objects of the set A are "compact" ("equivalent", "similar" or "close" to one another) in the space of n characteristics X - by C(A/X). The measure of "compactness" is chosen arbitrarily. For example, one can consider objects compact, if the Euclidean distance between vectors of their features does not exceed the value r.
   In fact, the hypothesis H is equivalent to the assumption that there exists a regular relation between features X and z, and its test algorithm may be represented as follows: if {C(A/X,z) & C(A,q/X)}, then C(A,q/z). This means that if objects of the set A are compact in the space X,z, and objects of the set A,q are compact in the space of describing properties X, then objects A and q will be compact in the space of goal feature z as well.
     The formulated compactness condition for pattern recognition problem is necessary but not sufficient. It is required that clusters of points of different patterns should not overlap one another. This well be denoted by  C'(A,B/X). In view of the foregoing, the compactness hypothesis H for pattern recognition may be written as 

if {C'(A,B/X)&C(A/X,z)&C(A,q/X)} then C(A,q/z).

     There are  different versions of compactness hypothesis: hypotheses of "unimodal" (Hu), "polymodal" (Hp) and "local" (Hl) compactness. Projections of compact clusters on coordinate axes will be compact as well, so another three variants of compactness hypothesis may be distinguished: hypotheses of "unimodal projective" (Hup), "polymodal projective" (Hpp) and "local projective" (Hlp) compactness.
    Hypothesis of unimodal compactness forms the basis of numerous taxonomy algorithms that give taxons of simple geometric form: hyperspheres or hyperparallelepipeds. For example, algorithms of FOREL family [2] form spherical taxons of a given radius R.
    If the system contains "noninformative" features, they "wash away" compactness, and make correct decision making difficult. The problem of selecting informative features consists in searching for the subspace wherein the characteristic of compactness attains its maximum value.
     When the hypothesis of local compactness is used, the number of datum points or "precedents" needed for reliable recognition of the training set may serve as the informatively criterion of the subsystem of features.
    Real "object-property" tables can contain gaps: values of some characteristics of objects are not given. The hypothesis of polymodal compactness is employed to predict the values of the missing elements [2].
     The hypothesis of compactness uses the values of the distances between vectors in the space of characteristics. However, some examples demonstrate that it is the relations between the distances (not the distances themselves) that are most important in data analysis problems. Formulation of the hypothesis of L-compactness rests upon the concept of l-distance. If the distances between all pairs of points of the set A are determined, then one can construct a graph without loops that joins all m points and has a minimum total length of its edges. Such a graph is called the Shortest Unclosed Way (SUW). If we single out some edge a and the edge bmin (the shortest one among the edges adjacent to a), then the relation l=a/bmin will characterize the measure of local inhomogeneity of distances or l-distance between the neighbouring points.
     Thus by analogy with the hypothesis of compactness, the hypothesis of l-compactness (lH) may be formulated as follows: realisations of one and the same pattern are commonly represented in the space of features as l-close points forming "l-compact" clusters [3]. Similarly to the hypothesis of compactness, one can distinguish hypotheses of unimodal (lHu), polymodal (lHp) and local (lHl) compactness, as well as all their projective versions: (lHup), (lHpp) and (lHlp).
     Algorithm  l-KRAB for solving taxonomy problems  based on l-compactness hypothesis is presented in OTEKS.


REFERENCES

1. A.G.Arkadiev, E.M.Braverman. Teaching a computer pattern recognition. - M.: Nauka, 1964.
2. N.G.Zagoruiko, V.N.Yolkina, S.V.Emelyanov, G.S.Lbov. OTEKS applied program package. - M.: Finansy i Statistika, 1986.
3.N.G.Zagoruiko. Hipotheses of compactness and l-compactness in data analysis methods.// Siberin Journal of Industrial Mathematics. Vol.1, N1, Novosibirsk, 1988.- pp. 114-126.