Transformer

Overview

Transformer is a pre-processing module turning numerical data into binary data.

Synopsis

./pre01_transformer.sh <database>  -c=[<integer>,<integer>,<integer>|auto]  -m=<method_name> [options depending on the method]

<database>    name of a tab-separated data file. See below for an example. 
 --help    : Print this message (-h).
 --list    : List all available methods (-l).
 --version : Print version of transformer (-v).
 -a        : Additive option for some methods.
 -c VAL    : Three comma separated integers giving indexes of the
             row header, first and last numerical columns to consider.
             VAL may be set to AUTO: indexes are automatically chosen.
 -cart     : Cartesian option for some methods
 -center   : Center dataset: set mean to zero.
 -i VAL    : List of comma separated intervals.
             An interval has the form  NAME:MIN:MAX:[i|e]:[i|e]
 -m VAL    : Method to use. Use --list to get available methods.
 -nbi N    : A number of intervals used by some methods.
 -nbsi N   : A number of sub-intervals used by some methods.
 -of VAL   : Output file.
 -p N      : A percentage used by some methods.
 -r        : Revert option for some methods.
 -reduce   : Reduce dataset: set std dev to one.
 -ref N    : A referent column. For each line, values are divided by the referent value.
             Finally, the referent column is removed.
 -t VAL    : List of comma separated thresholds.

N.B.: VAL represents a String while N an integer.

Database example

$ more sample/numeric/interordinal.test
id      m1      m2      m3
g1      5       7       6
g2      6       8       4
g3      4       8       5
g4      4       9       8
g5      5       8       5

Here, g1, … g5 are rows or object labels, while m1, …, m3 are attributes or column labels.
Value separator is a tabulation.

Method illustrations

Interval Coding

Each attribute domain is split into intervals given through the option -i.
An interval has the form NAME:MIN:MAX:[i|e]:[i|e]
This method has been used with coron in [6].
To cut attribute values in [0,5] and ]5,9] , we call

$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m interval -i A:0:5:i:i,B:5:9:e:i

it returns

[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1_A | m1_B | m2_A | m2_B | m3_A | m3_B
100101
010110
100110
100101
100110
[END Relational Context]

Crossed Interval Coding

This method is similar to Interval Coding. The difference is that intervals are pairwise considered in the order they have been given by the user.
Called with

$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m crossed -i A:0:5:i:i,B:5:9:e:i

it returns

[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1_A/B | m2_A/B | m3_A/B
111
111
111
111
111
[END Relational Context]

Slope Coding

The original numerical table is transformed into a binary table that captures the changing trend of the attribute values between each consecutive attribute. This trend could either be a rising trend, a falling trend or one that is considered to have no significant change. See [3] for more details.

The trends are characterized with a list of comma separated thresholds -t (original publication allows one threshold).

$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m slope -t 60

it returns

[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1/m2-t1 | m1/m2+t1 | m2/m3-t1 | m2/m3+t1
0100
0110
0110
0100
0110
[END Relational Context]

Mid-Ranged

The highest and lowest values are identified for each row and the mid-range value is defined. For a given row, all values that are strictly above the mid-range value give rise to value 1, 0 otherwise. See [2] for more details.

Called with

$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m midranged

it returns

[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1 | m2 | m3
010
010
010
011
010
[END Relational Context]

X% Max

For each row, we consider the attributes in which its value is in X% of the highest values. These rows are assigned to value 1, 0 otherwise. See [2] for more details. The percentage is given with -p VAL option where 0<VAL<=100.

For 75% Max, it is called with

$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m percentmax  -p 75

and returns

[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1 | m2 | m3
011
010
010
011
010
[END Relational Context]

Max - X% Max

The cut off is fixed w.r.t. the maximal value observed for each row. From this value, we remove a percentage X of this value. All values that are greater than the (100 - X)% of the Max value give rise to value 1, 0 otherwise. See [2] for more details. The percentage is given with -p VAL option where 0<VAL<=100.

For Max - 75% Max, it is called with

$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m diffpercentmax -p 25

it returns

[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1 | m2 | m3
011
010
010
011
010
[END Relational Context]

Motameny Coding

For each row, a cutting threshold is defined as the middle of the largest gap of values. This interval may be enlarged with a "noise" percentage given by option -p VAL where 0<=VAL<=100. See [5] for more details.

With a 1% noise parameter, it is called with

$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m motameny -p 1

it returns

[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1<=tg | m1>tg | m2<=tg | m2>tg | m3<=tg | m3>tg
100110
010110
100110
100101
100110
[END Relational Context]

Choi et al. Coding

This method is defined in [4]. It search for nbi-1 largest gap to build nbi intervals.
The largest interval is cut into nbsi sub intervals w.r.t. its own nbsi-1 largest gaps.

-nbi Number of intervals.
-nbsi Number of subinterval of the largest one.

Called with

$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m subinterval -nbi 1 -nbsi 2

it returns

[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1_a1 | m1_a2 | m2_a1 | m2_a2 | m3_a1 | m3_a2
111111
011110
111111
110111
111111
[END Relational Context]

X-fold Coding

This method allows to characterize variations of object values from an attribute of index i to an attribute of index i+1.
The option -t VAL gives a set of thresholds characterizing variations.
On our example with -t 1,2 will produce the following binary attributes

m1/m2_x1.0 | m1/m2_x2.0 | m2/m3_x1.0 | m2/m3_x2.0

Using the option -cart allows to consider any pair of attributes.

For example, calling

$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m xfold -t 1,2

returns

[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1/m2_x1.0 | m1/m2_x2.0 | m2/m3_x1.0 | m2/m3_x2.0
1010
1010
0110
0110
0110
[END Relational Context]

And if the option -cart is given, generated attribute will be

m1/m2_x1.0 | m1/m2_x2.0 | m1/m3_x1.0 | m1/m3_x2.0 | m2/m3_x1.0 | m2/m3_x2.0

Nominal scaling

This method is defined in [1] and does not need particular option.
First it searches for distinct values for each attribute.
Then it uses these values to build binary attributes as shown in the example.

Called with

$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m nominal

it returns

[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1=4.0 | m1=5.0 | m1=6.0 | m2=7.0 | m2=8.0 | m2=9.0 | m3=4.0 | m3=5.0 | m3=6.0 | m3=8.0 |
0101000010
0010101000
1000100100
1000010001
0100100100
[END Relational Context]

Ordinal scaling

This method is defined in [1] and does not need particular option.
First it searches for distinct values for each attribute.
Then it uses these values to build binary attributes as shown in the example.

$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m ordinal

it returns

[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1>=4.0 | m1>=5.0 | m1>=6.0 | m2>=7.0 | m2>=8.0 | m2>=9.0 | m3>=4.0 | m3>=5.0 | m3>=6.0 | m3>=8.0 |
1101001110
1111101000
1001101100
1001111111
1101101100
[END Relational Context]

Interordinal scaling

This method is defined in [1] and does not need particular option.
First it searches for distinct values for each attribute.
Then it uses these values to build binary attributes as shown in the example.
This method has been used with Coron in [7].

Called with

$ ./pre01_transformer.sh sample/numeric/interordinal.test -c auto -m interordinal

it returns

[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
g1 | g2 | g3 | g4 | g5
m1<=4.0 | m1<=5.0 | m1<=6.0 | m1>=4.0 | m1>=5.0 | m1>=6.0 |
 m2<=7.0 | m2<=8.0 | m2<=9.0 | m2>=7.0 | m2>=8.0 | m2>=9.0 |
 m3<=4.0 | m3<=5.0 | m3<=6.0 | m3<=8.0 | m3>=4.0 | m3>=5.0 | m3>=6.0 | m3>=8.0 |
01111011110000111110
00111101111011111000
11110001111001111100
11110000111100011111
01111001111001111100
[END Relational Context]

Bibliography
1. Formal Concept Analysis. B. Ganter & R. Wille, Mathematical Foundations, Springer Verlag, 1999.
2. Assessment of discretization techniques for relevant pattern discovery from gene expression data. R. G. Pensa, C. Leschi, J. Besson, J.-F. Boulicaut, BIOKDD, pages 24-30, 2004.
3. Quick Hierarchical Biclustering on Microarray Gene Expression Data. Ji, L., Mock, K. W., and Tan, K, In Proceedings of the Sixth IEEE Symposium on Bioninformatics and Bioengineering (October 16 - 18, 2006). BIBE. IEEE Computer Society, Washington, DC, pages 110-120, 2006.
4. Using Formal Concept Analysis for microarray Data Comparison. V. Choi, Y. Huang, Vy Lam, D. Potter, Reinhard C. Laubenbacher, Karen Duca, J. Bioinformatics and Computational Biology 6(1), pages 65-75, 2008.
5. Formal Concept Analysis for the Identification of Combinatorial Biomarkers in Breast Cancer. S. Motameny, B. Versmold, R. Schmutzler, ICFCA 2008, pages 229-240, 2008.
6. Using Formal Concept Analysis for the Extraction of Groups of Co-expressed Genes. M. Kaytoue, S. Duplessis, A. Napoli, MCO 2008, pages 439-449, 2008.
7. Two FCA-Based Methods for Mining Gene Expression Data. M. Kaytoue, S. Duplessis, S. O. Kuznetsov, A. Napoli, ICFCA 2009, pages 251-266, 2009.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License