Overview
Transformer is a pre-processing module turning numerical data into binary data.
Synopsis
./pre01_transformer.sh <database> -c=[<integer>,<integer>,<integer>|auto] -m=<method_name> [options depending on the method]
<database> name of a tab-separated data file. See below for an example.
--help : Print this message (-h).
--list : List all available methods (-l).
--version : Print version of transformer (-v).
-a : Additive option for some methods.
-c VAL : Three comma separated integers giving indexes of the
row header, first and last numerical columns to consider.
VAL may be set to AUTO: indexes are automatically chosen.
-cart : Cartesian option for some methods
-center : Center dataset: set mean to zero.
-i VAL : List of comma separated intervals.
An interval has the form NAME:MIN:MAX:[i|e]:[i|e]
-m VAL : Method to use. Use --list to get available methods.
-nbi N : A number of intervals used by some methods.
-nbsi N : A number of sub-intervals used by some methods.
-of VAL : Output file.
-p N : A percentage used by some methods.
-r : Revert option for some methods.
-reduce : Reduce dataset: set std dev to one.
-ref N : A referent column. For each line, values are divided by the referent value.
Finally, the referent column is removed.
-t VAL : List of comma separated thresholds.
N.B.: VAL represents a String while N an integer.
Database example
$ more sample/numeric/interordinal.test
id m1 m2 m3
g1 5 7 6
g2 6 8 4
g3 4 8 5
g4 4 9 8
g5 5 8 5
Here, g1, … g5 are rows or object labels, while m1, …, m3 are attributes or column labels.
Value separator is a tabulation.
Method illustrations
Interval Coding
Each attribute domain is split into intervals given through the option -i.
An interval has the form NAME:MIN:MAX:[i|e]:[i|e]
This method has been used with coron in [6].
To cut attribute values in [0,5] and ]5,9] , we call
$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m interval -i A:0:5:i:i,B:5:9:e:i
it returns
[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1_A | m1_B | m2_A | m2_B | m3_A | m3_B
100101
010110
100110
100101
100110
[END Relational Context]
Crossed Interval Coding
This method is similar to Interval Coding. The difference is that intervals are pairwise considered in the order they have been given by the user.
Called with
$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m crossed -i A:0:5:i:i,B:5:9:e:i
it returns
[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1_A/B | m2_A/B | m3_A/B
111
111
111
111
111
[END Relational Context]
Slope Coding
The original numerical table is transformed into a binary table that captures the changing trend of the attribute values between each consecutive attribute. This trend could either be a rising trend, a falling trend or one that is considered to have no significant change. See [3] for more details.
The trends are characterized with a list of comma separated thresholds -t (original publication allows one threshold).
$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m slope -t 60
it returns
[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1/m2-t1 | m1/m2+t1 | m2/m3-t1 | m2/m3+t1
0100
0110
0110
0100
0110
[END Relational Context]
Mid-Ranged
The highest and lowest values are identified for each row and the mid-range value is defined. For a given row, all values that are strictly above the mid-range value give rise to value 1, 0 otherwise. See [2] for more details.
Called with
$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m midranged
it returns
[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1 | m2 | m3
010
010
010
011
010
[END Relational Context]
X% Max
For each row, we consider the attributes in which its value is in X% of the highest values. These rows are assigned to value 1, 0 otherwise. See [2] for more details. The percentage is given with -p VAL option where 0<VAL<=100.
For 75% Max, it is called with
$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m percentmax -p 75
and returns
[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1 | m2 | m3
011
010
010
011
010
[END Relational Context]
Max - X% Max
The cut off is fixed w.r.t. the maximal value observed for each row. From this value, we remove a percentage X of this value. All values that are greater than the (100 - X)% of the Max value give rise to value 1, 0 otherwise. See [2] for more details. The percentage is given with -p VAL option where 0<VAL<=100.
For Max - 75% Max, it is called with
$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m diffpercentmax -p 25
it returns
[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1 | m2 | m3
011
010
010
011
010
[END Relational Context]
Motameny Coding
For each row, a cutting threshold is defined as the middle of the largest gap of values. This interval may be enlarged with a "noise" percentage given by option -p VAL where 0<=VAL<=100. See [5] for more details.
With a 1% noise parameter, it is called with
$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m motameny -p 1
it returns
[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1<=tg | m1>tg | m2<=tg | m2>tg | m3<=tg | m3>tg
100110
010110
100110
100101
100110
[END Relational Context]
Choi et al. Coding
This method is defined in [4]. It search for nbi-1 largest gap to build nbi intervals.
The largest interval is cut into nbsi sub intervals w.r.t. its own nbsi-1 largest gaps.
-nbi Number of intervals.
-nbsi Number of subinterval of the largest one.
Called with
$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m subinterval -nbi 1 -nbsi 2
it returns
[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1_a1 | m1_a2 | m2_a1 | m2_a2 | m3_a1 | m3_a2
111111
011110
111111
110111
111111
[END Relational Context]
X-fold Coding
This method allows to characterize variations of object values from an attribute of index i to an attribute of index i+1.
The option -t VAL gives a set of thresholds characterizing variations.
On our example with -t 1,2 will produce the following binary attributes
m1/m2_x1.0 | m1/m2_x2.0 | m2/m3_x1.0 | m2/m3_x2.0
Using the option -cart allows to consider any pair of attributes.
For example, calling
$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m xfold -t 1,2
returns
[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1/m2_x1.0 | m1/m2_x2.0 | m2/m3_x1.0 | m2/m3_x2.0
1010
1010
0110
0110
0110
[END Relational Context]
And if the option -cart is given, generated attribute will be
m1/m2_x1.0 | m1/m2_x2.0 | m1/m3_x1.0 | m1/m3_x2.0 | m2/m3_x1.0 | m2/m3_x2.0
Nominal scaling
This method is defined in [1] and does not need particular option.
First it searches for distinct values for each attribute.
Then it uses these values to build binary attributes as shown in the example.
Called with
$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m nominal
it returns
[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1=4.0 | m1=5.0 | m1=6.0 | m2=7.0 | m2=8.0 | m2=9.0 | m3=4.0 | m3=5.0 | m3=6.0 | m3=8.0 |
0101000010
0010101000
1000100100
1000010001
0100100100
[END Relational Context]
Ordinal scaling
This method is defined in [1] and does not need particular option.
First it searches for distinct values for each attribute.
Then it uses these values to build binary attributes as shown in the example.
$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m ordinal
it returns
[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1>=4.0 | m1>=5.0 | m1>=6.0 | m2>=7.0 | m2>=8.0 | m2>=9.0 | m3>=4.0 | m3>=5.0 | m3>=6.0 | m3>=8.0 |
1101001110
1111101000
1001101100
1001111111
1101101100
[END Relational Context]
Interordinal scaling
This method is defined in [1] and does not need particular option.
First it searches for distinct values for each attribute.
Then it uses these values to build binary attributes as shown in the example.
This method has been used with Coron in [7].
Called with
$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m interordinal
it returns
[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1<=4.0 | m1<=5.0 | m1<=6.0 | m1>=4.0 | m1>=5.0 | m1>=6.0 |
m2<=7.0 | m2<=8.0 | m2<=9.0 | m2>=7.0 | m2>=8.0 | m2>=9.0 |
m3<=4.0 | m3<=5.0 | m3<=6.0 | m3<=8.0 | m3>=4.0 | m3>=5.0 | m3>=6.0 | m3>=8.0 |
01111011110000111110
00111101111011111000
11110001111001111100
11110000111100011111
01111001111001111100
[END Relational Context]





