|
CLASSIFICATION
OF DATA ANALYSIS PROBLEMS [1]
Let us consider the problems of prediction of elements in the two-dimension
table of «object-property» type, where strings ai (i=1,2,..,m)
describe m objects and columns xj (j=1,2..,n) correspond to n properties
(characteristics) of these objects.
The elements to be predicted (b) may be positioned in a different way.
Depending upon this let us mark three FAMILIES of problems:
1) All elements bi0 are
situated in one column;
2) All elements bj0 are
situated in one string;
3) Elements bij0 belong
to different strings and columns.
In every such family we
will select the CLASSES of problems depending upon the number (q) of elements
to be predicted. The first family according to this classification will
have three classes of problems:
1.1) One element is predicted
(q=1);
1.2) Few elements are predicted
at once (1 < q < m);
1.3) All elements of the
column are predicted at once (q=m).
In the similar way let us select the classes of problems in the second
family:
2.1) q =1 ;
2.2) 1 < q < n ;
2.3) q = n .
In the third family there is two classes of problems:
3.2) 1< q < m*n ;
3.3) q = m*n
In each
of these eight classes of problems we will differ the TYPES of problems
in accordance with scales, applied for measurement of values of elements
to be predicted. We will differ three groups of scales: names (N), order
(P) and «quantitative» (K). The situation, when the different-types
elements are predicted will be marked by symbol (R).
The described classification is represented in the table 1.
Table 1
Let us
give some examples of the most often types of prediction problems.
The problem
1.1.N consists in prediction of one element in the column, measured in
the scale of names. It is the usual problem of pattern recognition: to
indicate the name of pattern (class) to which some new object b belongs
(to define the type of disease, predict presence or absence of oil etc.).
In the
problem 1.1.P all objects are put in order according to the aim property
x0 and it is necessary to find the position of the new object b in this
order (for example, to predict, that oil capacity of deposit b is higher,
than of ai-th one, but lower, than of ai+1-th one).
In the case
1.1.K it is necessary to give the quantitative estimation of the property
x0 of the object b (e.g., to predict the oil reserves in millions of tonnes).
If the objects in the table are put in order according to time, then the
1.1.K problem permits to predict the values of object properties in the
future.
Problems of
class 1.2 are similar in sense. But here it is necessary to make a decision
about few elements at once: to recognise q objects (type 1.2.N), to define
order positions of the group of objects (type 1.2.P) or estimate the quantitative
characteristic x0 for q objects at once (1.2.K).
Problems
of class 1.3 are of significant importance. To separate the objects according
to the similarity of their properties, i.e. to set some classification,
means to form some new column x0, measured in the scale of names (problem
of 1.3.N type). It is often called the problem of automatic classification
or taxonomy. Under expert estimation of m objects by n experts it is necessary
to define the summary estimation either in the scale of order (then it
is 1.3.P problem) or in more strong scale, e.g. in percents (problem of
1.3.K type).
The problems
of the second family are met when it is necessary, for example, to estimate
the informativeness of the properties, represented in the table. If the
existing properties are preliminary separated into «informative»
and «non-informative» classes then to define the place of some
new property among these groups will be the problem of 2.1.N type. If it
is required to indicate the place of new property in the preliminary regulated
set of properties, then the problem 2.1.P is solved. And if one needs to
estimate the informativeness of property b in bytes, then the problem 3.1.K
appears. For the group of properties in this class the problems 2.2.N,
2.2.P and 2.2.K are formulated.
The interpretation
of problems of estimation of whole aggregate of properties at once (problems
of 2.3.N, 2.3.P and 2.3.K types) is evident. Imagine the table with empty
spaces in different columns and strings. To predict the values of missed
elements one have to solve the problems of different types from the class
3.2, including the problem of prediction of different-type elements 3.2.P.
At last,
class 3.3 seizes the problems of generation of table with fixed properties:
test tables for checking of pattern recognition programs, tables of random
numbers etc. Depending upon the required scale type the problems of 3.3.N,
3.3.P, 3.3.K or 3.3.R will arise.
Not all described
types of problems are equally well studied. Some of them have an old history,
are well known, have well developed algorithms and programs for their solution,
which are used in different applied areas. Others are less known, but are
well understood and are used sometimes. There are such ones, which were
not yet formulated clearly and which interpretation is still complicated.
This software
represents the methods for the solution of the problems of the following
types: taxonomy
(1.3.N), selection of the informative
properties system (2.3.N), pattern recognition
(1.1.N, 1.2.N), filling
of gaps (3.2.N, 3.2.P, 3.2.K), prediction
of dynamic objects (1.1.K).
REFERENCES
1.Zagoruiko
N.G., Elkina V.N., Lbov G.S., Emelianov S.V. Package of Applied Programs
OTEKS. "Finance and Statistics". M., 1968 |