|
THE
SOLUTIONS OF PRACTICAL TAXONOMY PROBLEMS
OTEKS software was transferred into more, than 100 organisations of the
former USSR. During 30 years from the appearance of our first taxonomy
algorithms we also accumulated our own experience in their application.
Let us present some results of solution of applied problems.
Palaeontology
and geological exploring problems.
FOREL algorithm was developed in 1967 [1] and the first problem for
its testing was the problem from palaeontology area. Some animals, living
in different geological era, had solid chitinous shell, which printed very
well in ancient rocks. Palaeontologists find such prints and investigate
them, detecting the species, family and genus of former carriers of these
shells. The geologic time of appearance of that layer of the earth crust
is detected according to these prints of the past. One of such fossil creatures
was a trilobite (ancestors of contemporary cockroaches).
We received the table with description of 30 properties of 150 trilobites.
The size of chitinous shell, number of furrows at the head et cetera were
among the properties. FOREL algorithm was giving different number of taxons
for different radius R of spheres. For some value of R we obtained the
same number of taxons, that was earlier established by palaeologists after
the manual classification of this trilobite collection. We created the
taxonomic hierarchy tree and, to our common surprise, the composition of
taxons was accurately equal to the composition of manually obtained classes.
The palaeontologists were especially pleased by the fact, that one trilobite
was not going to join any taxon even at high values of R. It was found
out, that it is the «unique type from quite another family and it
is surprising and wonderful, that computer guessed it!»
This success have made such powerful expression on our colleagues-palaeontologists,
that we were asked at one of our seminars: «Can the computer guess
the proper Latin name for one or another trilobite?» We had to say
that it is hardly possible without prompting.
Another interesting problem from the geological area was taxonomy of territories
of Northeast Chukotka [2]. The investigated region was divided by geologists
into 1992 cells in a form of squares 10 by 10-km. Each square was described
by 45 binary properties, revealing existence of absence of different geological
properties: mercury aureole, depth of Mesozoic deposits equal to 3-4 km,
geosynclinal flexure etc.
The taxonomy of these sections was performed by means of modified FOREL
algorithm (FOREL-5), aimed for taxonomy of binary taxons. From various
solutions the customers selected the variant, containing 318 taxons. This
variant attracted their attention, because 46 of 318 taxons included the
sections, which were earlier investigated very well and where the gold-carrying
deposits were found. So in planning of expedition works aimed for the search
of gold it is reasonable to pay attention at the first step to those of
unexplored sections, which were also found among these «gold»
taxons. Results of expeditions confirmed high efficiency of such method
of geological exploring planning. In the same way the recommendations on
discovering of deposits of other minerals were developed.
Sociology
and economics problems.
To develop the plans of expedition works, related to investigation of social
problems of countryside population of Altay region, sociologists needed
to choose for investigation k villages, representing as much as possible
the different types of region villages. To reach the aim the description
of all region villages was prepared, containing such properties as population
quantity number schools and clubs, type of water and electricity supply,
availability of roads with hard cover etc.
By means of FOREL algorithm the multitude of few hundreds of villages was
separated into fixed number of taxons k and the typical representatives
of each taxon were selected. Thus it was guaranteed, that k selected villages
represent rather well all variety of region villages and that some types
of villages will not be lost and that facilities will not be spent for
investigation of «twin» villages.
After completion of expedition works the sociologists brought with themselves
the huge material in a form of questionnaires with answers of people. To
treat these data the taxonomy algorithms were used as well. The problem
of taxonomy analysis of such materials, related to the revelation of the
causes of migration of countryside population to cities, was described
in the Introduction.
One more problem, connected with population migration problems, was being
solved on the materials, describing all region, districts and autonomic
republics of Russian Federation. There were selected the taxons, including
administrative units with positive, zero and negative balance of population
migration. Analysis of taxon characteristics gave the way to understand
the relative significance of separate factors in migration processes. So,
it was discovered that the level of salary is significantly less important
being compared to the provision of the population by the state-funded habitation.
And if it was necessary to decide where to invest financial resources,
then at first it was necessary to pay attention to dwelling construction.
These and others problems of sociological data analysis are presented in
[3,4].
The wide application of taxonomy methods was found in the problems of statistical
analysis of economics data [5].
Biological
problems.
In creation of new species of plants and animals the selectioners are trying
to select for interbreeding the most unlike species, avoiding interbreeding
of «twins». With this aim, the description of properties of
all potential «parents» undergoes the taxonomy and the «persons»
from different taxons are selected for interbreeding. This problem is similar
to one of the aforementioned problems of sociology.
Biophysicists study the impact of different influences on living organisms.
The first series of experiments was performed at the large number of species.
As a result, the observation protocol represented itself the table of more
than 20 species, everyone of them being described two groups of properties:
8 influence characteristics and 14 characteristics of organism response.
For every new combination influencing impacts values the new combination
of organism response was observed, what was fixed in the protocol in a
form of new table «object-property».
Every such table was treated by taxonomy algorithm KRAB, what provided
to automatically select the best number of taxons k in the given range
kmin< k <kmax. The taxonomy was performed both for separate group
of properties and in full 22-dimension space.
As a
result of comparison between different tables it was found out, that there
are species of living organisms, which responses approximately identically
to equal external impacts and get into one taxon in different taxonomies.
One typical representative of each such stable taxons was selected for
more detailed experiments, thus providing to significantly accelerate the
investigation and to save the expenses for experiments.
Oceanology
problems.
One of the oceanology problems was represented by the following data. The
experiment on measuring the water temperature and salt content at 16 different
depths was performed in the definite point of the world ocean surface.
The protocol contained two co-ordinates of the point and also 32 characteristics
(16 temperatures and 16 salt contents). The overall number of investigated
points in the world ocean was equal to ~20000. Thus it was necessary to
perform the taxonomy on the table with dimension of 20000x34 and it was
made by means of FOREL algorithm.
Data authors selected one of the taxonomy variants with the taxon number,
equal to 15. When they coloured the points of each taxon by the same colour
on the map, the zones with similar profiles of temperatures and salt content
became visible. In particular, the famous ocean streams (Gulfstream, Kurosivo
and others) were clearly seen. The regularities of world ocean structure,
interesting for oceanologists, were discovered.
The
problems of voice signal recognition («Code book»).
Systems of speech recognition often apply the spectral characteristics,
measured at short sections of a signal, following each other. Every section
is revealed in n-dimension space of spectral features by a point and the
word may be represented in a form of trajectory, marked by these points.
After accumulation of training material the features space may contain
hundreds thousands points and it is reasonable to keep in memory not all
points, but taxons, describing them. By taxonomy methods the taxonomy of
points into k taxons is performed and all pair distances between taxons
are calculated. Such matrix of pair distances is called the «code
book».
Each section of the pronounced word will get into the vicinity of one or
another taxon. If we will fix the numbers (codes) of the most closed taxons,
then the word may be represented by the sequence of such codes. After the
training the standards of words appears in the computer memory in a form
of such sequences.
To recognise the control word its code sequence is compared to all standards
and the most similar standard is chosen. It applies the dynamic programming,
which require the knowledge of all distances between all standard codes
and all codes of the word to be recognised. The existence of the codebook
gives the way to significantly simplify this labour-consuming stage. Now
it is necessary only to indicate the numbers of two codes and the distance
between them will be taken from the codebook.
Other
application areas.
Taxonomy algorithm are applied in soil science to classify the soil types,
what is especially important now for development of the soil cadastre in
purposes of forthcoming privatisation of ground in Russia.
Analysis of weather taxons, obtained on the array of three-year meteorological
observations in the forest area of Krasnoyarsk region, permitted to find
few taxons, related to the days when the fire appeared in the place of
observation. There were taxons with 50 days of fire and there were the
days (i.e. combination of weather conditions), when no fire was observed.
These data in combination with weather forecasts help to plan the optimal
distribution of fire-prevention service resources.
Analysis of psychological characteristics of Perm State University gave
the way to separate the groups of students with similar characteristics.
Such information may help in formation of the optimal composition of learning
groups, in selection of typical methods of psychological correction of
students and so on.
Some
remarks about taxonomy.
Inexperienced user usually asks himself: does the «objective»
or «natural» taxonomy exists, or it is always «subjective»?
The answer to this question is that every taxonomy and every classification
contains both subjective and objective elements. It is very well demonstrated
by an example from the book of M. Bongard [6], represented in Fig.2.
Fig. 2.
There are six figures, which may be divided in the different way and into
the different number of taxons. So, if we will pay attention to the colour,
then we will find two taxons: white and shaded figures. If we will count
the number of angles, then we will find three taxons: figures with three,
four and infinite number of angles. If we will look at the surface area
of figures, then it will be possible to find two taxons (large and small
figures) or three taxons (large, medium and small figures).
Thus it is seen that the only one, «the most natural», «the
absolutely objective» taxonomy does not exist. All real objects have
infinite number of properties and selection of finite submultitude of these
properties is the subjective action. Measures of closeness, criteria of
quality are also chosen subjectively. If the taxonomy is performed to reach
the known aim (i.e. when such «super-aim» exists), then the
taxonomy quality is checked by the following – does it lead to this aim,
is it convenient, is it economical etc. Such checking is objective, but
the selection of the «super-aim» is again subjective and the
given taxonomy will be good for one super-aim and bad for another.
Sometimes one can meet the following opinion: «Taxonomy algorithm
gave the bad result: the one big taxon was selected, three smaller ones
and all other points are in the solitary taxons». The algorithm is
not always guilty in such result. One can meet data, formed by one uniform
process and which may be described by the standard distribution law and
no taxonomy algorithm can divide such set into 5 or 7 «independent»
taxons. In that case one can say to console oneself, that taxonomy permits
not only to reveal the structure of well-structured multitude, but also
to show, that some multitude is homogeneous and it does not stratify into
isolated submultitudes. Often it is what we wanted to know.
There are also such situations: «I am not satisfied with this taxonomy.
One taxon is good, it really contains the objects of the similar nature.
All others are quite mixed». Yes, taxonomy cannot exclude such result
and the reason may be the bad quality of algorithm, but it also may be
explained by unsuccessful selection of properties, describing the objects.
One can find, that characteristics are not informative from the point of
view of the super-aim, which is intuitively formulated by the user. Thus,
the taxonomy algorithms can help to examine the informativeness of available
properties.
By the way, if the user knows the partial information, i.e. if he knows
which objects must be in one taxon and which – in different ones, then
such information may be usefully applied.
For the same object properties the taxonomy results may be different, if
we will account for their relative weights («importance»).
In calculation of the distance between q and p objects the contribution
of xj feature must be proportional to its weight coefficient wj,
so the Euclid distance r(pq) in n-dimension space will defined as follows:
The values of weights j may be established in advance, but sometimes the
problems consists exactly in the search of the relative importance of different
properties. If the desirable taxonomy is known then, solving the reverse
problem, it is possible to find the combination of weights wj, providing
exactly this taxonomy.
Experience of the many years of taxonomy algorithms application showed
that taxonomy data analysis is the powerful tool for understanding of regularities
in investigated objects or phenomena.
REFERENCES
1. Elkin E.A.,
Elkina V.N., Zagoruiko N.G. The Possibility of Application of Taxonomy
Methods in Paleontology. // Journalm "Geology and Geophisics"., Vol 9.
Novosibirsk, 1967.
2. Elkina V.N., Zagoruiko
N.G., Kuklin A.P. Tips of the gold-bearing territories of Chuckchee
plicate province. J. «Kolyma», N4, Magadan, 1974,
p.41-45.
3.Zagoruiko N.G., Zaslavskaia
T.I. Pattern recognition in Social Research. Novosibirsk, Nauka,
1968
4.N.G.Zagoruiko, T.I.Zaslavska.
On possibility of pattern recognition methods utilisation in sociological
research. Int. J. "Quality and Quantity" v. IV (1970), n.2, pp. 365-374.
5.Â.Í.Åëêèíà,
Çàãîðóéêî Í.Ã.,
Íîâîñåëîâ Þ.À.
Ìàòåìàòè÷åñêèå
ìåòîäû àãðîèíôîðìàòèêè.
Òð. ÈÌ ÑÎ ÀÍ ÑÑÑÐ,
Íîâîñèáèðñê,
1987 .
6. Ì. Ì. Áîíãàðä.
Ïðîáëåìà óçíàâàíèÿ.
Ì., Íàóêà, 1967 ã. |