Типы данных

Проблема типов данных

The design of a particular data mining system implies the selection of the set of data types supported by that system. In Object-Oriented Programming (OOP), this is a part of the software design. Data types are declared in the process of software development. If data types of a particular learning problem are out of the range of the data mining system, users have two options: to redesign a system or to corrupt types of input training data for the system. The first option often does not exist at all, but the second produces an inappropriate result.

There are two solutions for this problem. The first is to develop data type conversion mechanisms which may work correctly within a data mining tool with a limited number of data types. For example, if input data are of a cyclical data type [Krantz et al, 1970, 1981, 1990; Kovalerchuk, 1973] and only linear data types are supported by the DM tool, one may develop a coding of the cyclical data such that a linear mechanism will process the data correctly.

Another solution would be to develop universal DM tools to handle any data type. In this case, member functions of a data type should be input information along with the data (MMDR implements this approach). This problem is more general than the problem of a specific DM tool.

Difficulties example

Difficulties arise from the uncertainty of the set of interpretable operations and predicates for these data, i.e., uncertainty of empirical contents of data. The precise definition of empirical content of data will be given in terms of the empirical axiomatic theory below.

Numerical data type

In previous sections, the term numerical data was used without formal definition. Actually, there are several different numerical data types. The strongest one is called absolute data type (absolute scale). Having this data type in background knowledge B of a learning problem, we can use most of the known numerical relations and operations for discovering a rule. The weakest numerical data type is nominal data type (nominal scale). This data type is the same as the equivalence data type defined in Section 4.9.1. It allows us only one interpretable relation Q(x,y). In other words, it is known if x and y are equal. In between there is a spectrum of data types allowing one to compare values with ordering relations, to add, multiply, divide values and so on. Stevens [1946] suggested classification of these data types. They are presented in Table. The basis of this classification is a transformation group. The strongest absolute data type does not permit to transform data at all, and the weakest nominal data type permits any one-to-one transformation. This data type permits one-to-one coding of data entities by numbers. Intermediate data types permit different transformations such as positive monotone, linear and others (see Table 4.26) [Krantz, et al, 1990].

Numerical data types (Stevens’ classification of scale types)

Relational data types

A data type is relational if it is described in terms of the set of relations (predicates). Some basic relational data types include:

Equivalence data type

Tolerance data type

Partial ordering data type

Weak ordering data type

Strong ordering data type

Semi-ordering data type

Interval ordering data type

Tree data type

Next, we define these data types.

The function U is called a numeric representation of A. It is not necessary that a numeric representation exist for any data type. For instance, a partial ordering relation does not have a numeric representation. At first glance this seems strange, because we always can code elements of A with numbers. However, there is no numerical coding consistent with a partial ordering relation P. Consider an a and b which are not comparable, i.e., P(a,b)=0, but a and b are coded by numbers. Any numbers are obviously comparable, e.g. (4<5). Therefore, a non-interpretable property is brought to A by numerical coding of its elements. However, if numerical comparison is not used for constructing a learned rule, any one-to-one numerical coding can be used. If a numerical order is used in learning a rule, then this rule can be non-interpretable.

If some ordering relation is interpretable in terms of domain background knowledge, then it can be included in background knowledge and used for discovering rules in RDM directly. Partial ordering and tree-type relations often appear in hierarchical classifications and in decision trees. In financial applications, usually the data are presented as numeric attributes, but often relations are not presented explicitly. More precisely, these attributes are coded with numbers, but applicability of number relations and operations must be confirmed.

Financial Illustration of order relations

What is the traditional way of processing binary relations? Traditionally numerical methods use distance functions (from a metric space) between matrices of relations (binary relations). These distance functions are defined axiomatically or using some statistical assumptions associated with coefficients of Stuart, Yule and Kendall, information gain measures, etc. Obviously, these assumptions restrict the areas of applicability for these methods.

Critical analysis of data types in ABL

The empirical status of data types for different learning tasks and their treatment by known numerical data mining methods such as regression, correlation, covariance analysis, and analysis of variance applications are examined in this section. In the previous section, it was required that a relational system A representing a data type should be interpreted as an empirical real-world system, i.e., A should be included in the domain background knowledge of a learning task. Numerical methods such as regression, correlation, covariance analysis and analysis of variance assume that any numerical standard mathematical operations (+,-,*, / and so on) can be used despite their possible non-interpretability. In this way, a non-interpretable learned rule can be obtained as well. Let us consider this situation in more detail for six different cases.

Physical data types in physical problems.

Physical data types for non-physical problems.

Non-physical data types for non-physical tasks.

Nominal discrete data types

Non-quantitative and non-discrete data types.

Mix of data types.