|
BASIC
HYPOTHESES FOR DATA ANALYSIS METHODS
Rigorous mathematical methods for constructing statistical
decision functions have been developed for the cases where
all information about the distributions of general populations of
patterns needed to solve a problem is available.
Unfortunately,
such conditions are not met with in solving real pattern
recognition problems. It is necessary to complement the available
objective information by a number of subjectively chosen assumptions or
hypotheses. The paper presents two basic hypotheses - those of "compactness"
and "l-compactness"
- and shows their influence on the character of data analysis algorithms.
The hypothesis of compactness (H) consists in the following : realisations
of one and the same pattern are commonly represented in the space of features
as closely spaced points, forming «compact» clusters [1]. Though
the above hypothesis seems trivial and easy to rule out, it forms the basis
of the majority of recognition algorithms and the algorithms of other data
analysis problems.
Let us call n features contained in informative subset X "describing",
and nominal (n+1) feature z specifying the name of the pattern - "goal".
Denote the set of objects of the training sample by A, a new object to
be recognised - by q, and the fact that the objects of the set A are "compact"
("equivalent", "similar" or "close" to one another) in the space of n characteristics
X - by C(A/X). The measure of "compactness" is chosen arbitrarily. For
example, one can consider objects compact, if the Euclidean distance between
vectors of their features does not exceed the value r.
In fact, the
hypothesis H is equivalent to the assumption that there exists a regular
relation between features X and z, and its test algorithm may be represented
as follows: if {C(A/X,z) & C(A,q/X)}, then C(A,q/z). This means that
if objects of the set A are compact in the space X,z, and objects of the
set A,q are compact in the space of describing properties X, then objects
A and q will be compact in the space of goal feature z as well.
The formulated compactness condition for pattern recognition problem is
necessary but not sufficient. It is required that clusters of points of
different patterns should not overlap one another. This well be denoted
by C'(A,B/X). In view of the foregoing, the compactness hypothesis
H for pattern recognition may be written as
if {C'(A,B/X)&C(A/X,z)&C(A,q/X)}
then C(A,q/z).
There are different versions of compactness hypothesis: hypotheses
of "unimodal" (Hu), "polymodal" (Hp) and "local" (Hl) compactness. Projections
of compact clusters on coordinate axes will be compact as well, so another
three variants of compactness hypothesis may be distinguished: hypotheses
of "unimodal projective" (Hup), "polymodal projective" (Hpp) and "local
projective" (Hlp) compactness.
Hypothesis
of unimodal compactness forms the basis of numerous taxonomy
algorithms
that give taxons of simple geometric form: hyperspheres or hyperparallelepipeds.
For example, algorithms of FOREL
family [2] form spherical taxons of a given radius R.
If the
system contains "noninformative" features, they "wash away" compactness,
and make correct decision making difficult. The problem of selecting informative
features consists in searching for the subspace wherein the
characteristic of compactness attains its maximum value.
When the hypothesis of local compactness is used, the number of datum points
or "precedents" needed for reliable recognition of the training set may
serve as the informatively criterion of the subsystem of features.
Real
"object-property" tables can contain gaps: values of some characteristics
of objects are not given. The hypothesis of polymodal compactness is employed
to predict the values of the missing
elements [2].
The hypothesis of compactness uses the values of the distances between
vectors in the space of characteristics. However, some examples demonstrate
that it is the relations between the distances (not the distances themselves)
that are most important in data analysis problems. Formulation of the hypothesis
of L-compactness rests upon the concept of l-distance.
If the distances between all pairs of points of the set A are determined,
then one can construct a graph without loops that joins all m points and
has a minimum total length of its edges. Such a graph is called the Shortest
Unclosed Way (SUW). If we single out some edge a and the edge bmin (the
shortest one among the edges adjacent to a), then the relation l=a/bmin
will characterize the measure of local inhomogeneity of distances or l-distance
between the neighbouring points.
Thus by analogy with the hypothesis of compactness, the hypothesis of l-compactness
(lH)
may be formulated as follows: realisations of one and the same pattern
are commonly represented in the space of features as l-close
points forming "l-compact"
clusters [3]. Similarly to the hypothesis of compactness, one can distinguish
hypotheses of unimodal (lHu),
polymodal (lHp)
and local (lHl)
compactness, as well as all their projective versions: (lHup),
(lHpp)
and (lHlp).
Algorithm l-KRAB
for
solving
taxonomy problems based
on l-compactness
hypothesis is presented in OTEKS.
REFERENCES
1. A.G.Arkadiev,
E.M.Braverman. Teaching a computer pattern recognition. - M.: Nauka, 1964.
2. N.G.Zagoruiko,
V.N.Yolkina, S.V.Emelyanov, G.S.Lbov. OTEKS applied program package. -
M.: Finansy i Statistika, 1986.
3.N.G.Zagoruiko. Hipotheses
of compactness and l-compactness
in data analysis methods.// Siberin Journal of Industrial Mathematics.
Vol.1, N1, Novosibirsk, 1988.- pp. 114-126. |