Coron-base

Overview

Coron-base is the most important module of the Coron platform. This module is responsible for the extraction of different itemsets, providing input to the other modules of the system. With Coron-base one can extract the following itemsets:

  • frequent itemsets (FIs)
  • frequent closed itemsets (FCIs)
  • frequent generators (FGs)
  • maximal frequent itemsets (MFIs)
  • minimal rare itemsets (mRIs)
  • minimal rare generators (mRGs; equivalent to mRIs)
  • rare itemsets (RIs)
  • perfectly rare itemsets (PRIs)
  • minimal zero generators (mZGs)
  • pseudo-closed itemsets (PCIs)

Command-line usage

./core01_coron.sh  [switches]  <database>  <min_supp>  [-alg:<alg>]

There are two compulsory parameters:

  1. the database file (in .basenum, .bool or .rcf format), and
  2. the minimum support (in absolute or relative value).

The minimum support can be given in either absolute or relative value, e.g. 2 or 40%.

The algorithm to be used can be specified with the -alg:<alg> switch. The available algorithms are described below.

Input file formats

Throughout this guide we will work with the dataset shown in Table 1. In the examples we assume that this dataset is stored in a file called laszlo.rcf in .rcf format. The supported file formats are shown in Table 2. Line i of a .basenum file contains items that are included in object i. The .bool file is a binary matrix representation of the binary database. The .rcf file is very similar to the .bool file format but it has the advantage that names can be assigned to objects and to attributes.

Table 1: A sample dataset for the examples.

A (1) B (2) C (3) D (4) E (5)
o1 X X X X
o2 X X
o3 X X X X
o4 X X X
o5 X X X X

Table 2: Our example dataset (Table 1) in different file formats.

(1) .basenum (2) .bool (3) .rcf
1 2 4 5
1 3
1 2 3 5
2 3 5
1 2 3 5
1 1 0 1 1
1 0 1 0 0
1 1 1 0 1
0 1 1 0 1
1 1 1 0 1
[Relational Context]
Default Name
[Binary Relation]
Name_of_dataset
o1 | o2 | o3 | o4 | o5
a | b | c | d | e
1 1 0 1 1
1 0 1 0 0
1 1 1 0 1
0 1 1 0 1
1 1 1 0 1
[END Relational Context]

A short example

./core01_coron.sh  sample/laszlo.rcf  2  -names  -alg:apriori

Meaning: from the file sample/laszlo.rcf extract all FIs using Apriori with min_supp=2. Instead of attribute numbers, show attribute names.

Result:

# Database file name:               sample/laszlo.rcf
# Database file size:               208 bytes
# Number of lines:                  5
# Largest attribute:                5
# Number of attributes:             5
# Number of attributes in average:  3.4
# min_supp:                         2, i.e. 40%
# Chosen algorithm:                 Apriori

{a} (4)
{b} (4)
...
# FIs: 15

Statistics

At the beginning and at the end there are some statistics about the dataset and the number of found itemsets.

If you only want to analyze the input dataset without calculating the itemsets, use the -stat option:

./core01_coron.sh  sample/laszlo.rcf  -stat

In this case the program terminates after showing the database statistics.

The -names option is highly recommended. It works only for .rcf files. With this option, attribute numbers can be replaced with their names. The example above without -names would look like this:

./core01_coron.sh  sample/laszlo.rcf  40%  -alg:apriori

Result:

 {1} (4)
 {2} (4)
…

This means: the first attribute has support 4, the second attribute has also support 4, etc.

Available algorithms with examples

Here we present the available algorithms in Coron-base.

Common notations in the outputs:

  • support values are between parentheses, e.g. (4)
  • closed itemsets are marked with a '+' sign
  • generators are also called as keys

A-Close (-alg:aclose)

A levelwise alg. to find FGs and their closures. Not too efficient since it needs to do lots of intersections for computing the corresponding FCIs.

./core01_coron.sh sample/laszlo.rcf 2 -names -alg:aclose

Sample output:

{b} (4) [key: +; closure: {b, e}]

Meaning: B is an FG with the closure BE. Support value: 4.

Notes: to read more about A-Close, please refer to [4].

Apriori (-alg:apriori)

This is the absolute basic levelwise algorithm for finding FIs. It is efficient for sparse datasets only.

./core01_coron.sh sample/laszlo.rcf 2 -names -alg:apriori

Sample output:

{a} (4)

Meaning: A is an FI with support 4.

Notes: to read more about Apriori, please refer to [5], [6], [7].

Apriori-Close (-alg:aprioriclose, -alg:ac)

An extension of Apriori that finds FIs and marks FCIs. The FCI-identification gives no overhead.

./core01_coron.sh sample/laszlo.rcf 2 -names -alg:aprioriclose

Sample output:

{a} (4) +
{b} (4)

In each algorithm, the '+' sign means that the itemset is closed. That is: A is an FCI with support 4; B is frequent (but not closed) with support 4.

Notes: to read more about Apriori-Close, please refer to [8].

Apriori-Inverse (-alg:aprioriinverse, -alg:ai)

Finds perfectly rare itemsets. An itemset is a PRI if all its subsets are rare (with the exception of the empty set).

./core01_coron.sh sample/laszlo.rcf 2 -names -alg:aprioriinverse

Sample output:

{d} (1)

Notes: to read more about Apriori-Inverse, please refer to [9].

Apriori-Rare (-alg:apriorirare)

Finds minimal rare itemsets. An itemset is an mRI if it is rare and all its subsets are frequent. Here you must specify an extra option:

You want to extract rare itemsets. You must choose:
   -all     : extract zero itemsets too (where support = 0), or
   -nonzero : extract non-zero itemsets only (where support > 0)
You must specify just one option.

Among mRIs there can be lots of itemsets with support 0. We call them zero itemsets (not to be confused with the empty set). With this option you can choose between showing or hiding them.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:apriorirare -all

Sample output:

{d} (1)
{a, b, c} (2)

These itemsets are rare (support is below 3) and all their subsets are frequent.

Notes: to read more about Apriori-Rare, please refer to [1].

Arima (-alg:arima)

Arima calls Apriori-Rare to extract mRIs. Then, from this set it restores the family of non-zero rare itemsets. This process is memory extensive.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:arima

Sample output:

{a, b, d, e} (1)
{a, b, c, e} (2)

It produces all rare itemsets with support below 3 and above 0. The alg. also counts the number of minimal zero generators, though they are not shown in the output.

Notes: to read more about Arima, please refer to [1].

BtB (-alg:btb)

Breaking the Barrier. First, the alg. finds the set of non-zero minimal rare generators. Actually, the set of mRIs and the set of mRGs are equivalent, as we showed it in [1]. An mRG is a rare generator and all its subsets are frequent itemsets (more precisely frequent generators). Then, BtB finds the closures of these mRGs. The output is a set of equivalence classes whose generators are mRGs.

./core01_coron.sh sample/laszlo.rcf 4 -names -alg:btb

Sample output:

{a, b, e} (3) +; [{a, b}, {a, e}]
{a, b, d, e} (1) +; [{d}]

Since min_supp=4, we are interested in eq. classes below this threshold (and above support 0, i.e. zero itemsets are eliminated in the output). ABE is a closed itemset with the following generators: AB and AE. These two generators are mRGs.

Notes: to read more about BtB, please refer to thesis of Laszlo Szathmary [2].

Carpathia (-alg:carpathia)

Similar to A-Close. A-Close first finds all FGs and then it calculates the closures. When Carpathia extracts the i-long FGs, it calculates their closures immediately. Then it finds the (i+1)-long FGs, etc.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:carpathia

Sample output:

{b} (4) [key: +; closure: {b, e}]

B is an FG whose closure is BE.

Notes: to read more about Carpathia, please refer to [1]. In [1] we present MRG-Exp which is very similar to Carpathia. MRG-Exp starts by exploring the family of FGs, just like Carpathia. After this Carpathia finds their closures which is not done by MRG-Exp.

Carpathia-G (-alg:carpathiag)

A levelwise alg. to find FGs only. Like Carpathia without the closure computation step.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:carpathiag

Sample output:

{a} (4) [key: +]
{b} (4) [key: +]

Meaning: A is an FG with support 4.

Notes: to read more about Carpathia-G, please refer to [1]. In [1] we present MRG-Exp which is very similar to Carpathia-G. Difference: MRG-Exp retains the rare itemsets (mRGs, to be more precise).

Carpathia-G-Rare (-alg:carpathiagrare)

Finds minimal rare generators (mRGs). An mRG is a rare generator and all its subsets are frequent itemsets (more precisely frequent generators). Actually, the set of mRIs and the set of mRGs are equivalent (shown in [1]). Carpathia-G-Rare extracts FGs and during the process it filters mRGs. Like in the case of Apriori-Rare, you must use an extra option here:

You want to extract rare itemsets. You must choose:
   -all     : extract zero itemsets too (where support = 0), or
   -nonzero : extract non-zero itemsets only (where support > 0)
You must specify just one option.

That is, you can decide whether to show or hide zero itemsets in the output.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:carpathiagrare -nonzero

Sample output:

{a, b, c} (2) [key: +]
{a, c, e} (2) [key: +]

With the -nonzero option we decided to hide the zero itemsets. Since min_supp=3, we want to extract itemsets below this threshold (and above 0). ABC is an mRG with support 2. By definition, all its subsets are FGs.

Notes: to read more about Carpathia-G-Rare, please refer to [1]. In [1], Carpathia-G-Rare is renamed to MRG-Exp.

Charm (-alg:charm)

A very efficient depth-first alg. to extract FCIs. More efficient on dense datasets.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:charm

Sample output:

{a, b, e} (3) +

Meaning: ABE is an FCI with support 3.

Notes: to read more about Charm, please refer to [10].

Close (-alg:close)

A levelwise alg. to find FCIs. Note that it was the very first alg. that we implemented in the Coron System.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:close

Sample output:

{a, b, e} (3) +

Meaning: ABE is an FCI with support 3.

Notes: to read more about Close, please refer to [11].

CloseByOne (-alg:cbo)

CloseByOne extracts all closed itemsets in a depth-first manner. Note that the algorithm requires minimum support to be 1. The reason is that this is a bottom-up algorithm: it generates itemsets from minimal (bottom) to maximal support (top).

./start.sh sample/laszlo.rcf -alg:cbo -names

Sample output:

{a, b, e} (3) +
{b, e} (4) +

Notes: This algorithm comes from the FCA community. To read more about CloseByOne, please refer to [3].

dCharm (-alg:dcharm)

A diffset implementation of Charm. A very efficient depth-first alg. to extract FCIs. More efficient on dense datasets. It is like Charm but instead of tidsets we use diffsets. As a result, it performs better and uses less memory. Its output is the same as Charm's.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:dcharm

Sample output:

{b, c, e} (3) +

Meaning: BCE is an FCI with support 3.

Notes: to read more about dCharm, please refer to [12].

dEclat (-alg:declat)

A diffset implementation of Eclat. A very efficient depth-first alg. to extract FIs. Its output is the same as Eclat's, but it performs better and requires less memory.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:declat

Sample output:

{a, b} (3)

Meaning: AB is an FI with support 3.

Notes: to read more about dEclat, please refer to [12].

dTalky-G (-alg:dtalkyg)

A diffset implementation of Talky-G. A very efficient depth-first alg. to extract FGs. Its output is the same as Talky-G's, but it performs better and requires less memory.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:dtalkyg

Sample output:

{c, e} (3) [key: +]

Meaning: CE is an FG with support 3.

Notes: to read more about dTalky-G, please refer to [13].

dTouch (-alg:dtouch)

A diffset implementation of Touch. A very efficient depth-first alg. to extract FCI/FG pairs. Its output is the same as Touch's, but it performs better and requires less memory. dTouch consists of 3 steps: (1) extract FCIs with dCharm; (2) extract FGs with dTalky-G; and (3) associate the FGs to their closures. Output: all frequent equivalence classes.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:dtouch

Sample output:

{b, c, e} (3) +; [{b, c}, {c, e}]

Meaning: BCE is a closed itemset with support 3. In its equivalence class there are two FGs namely BC and CE.

Notes: to read more about dTouch, please refer to [13].

Eclat (-alg:eclat)

A very efficient depth-first alg. to extract FIs.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:eclat

Sample output:

{a, c} (3)

Meaning: AC is an FI with support 3.

Notes: to read more about Eclat, please refer to [14], [15].

Eclat-Z (-alg:eclatz)

A hybrid alg. to extract FCI/FG pairs. First it finds all FIs using the depth-first alg. Eclat, then it processes the FIs in a way similar to Zart, i.e. it filters FCIs, identifies FGs, and associates FGs to their closures. Output: all frequent equivalence classes.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:eclatz

Sample output:

{b, c, e} (3) +; [{b, c}, {c, e}]

Meaning: BCE is a closed itemset with support 3. In its equivalence class there are two FGs namely BC and CE.

Notes: to read more about Eclat-Z, please refer to [16].

MFI-Simple (-alg:mfi)

Finds maximal frequent itemsets (MFIs). First it finds all FCIs, then it filters MFIs.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:mfi

Sample output:

{a, c} (3) [mfi: +] +

Meaning: AC is an MFI. By definition, MFIs are also FCIs.

(1)
\begin{align} MFI\subseteq FCI\subseteq FI \end{align}

Notes: to read more about MFI-Simple, please refer to thesis of Laszlo Szathmary [2].

Next-Closure (-alg:nc)

Next-Closure, also called Ganter's algorithm, extracts all closed itemsets. For this algorithm, minimum support must be set to 1.

. start.sh sample/laszlo.rcf -alg:nc -names

Sample output:

{a, b, d, e} (1) +
{a, b, e} (3) +

Notes: This algorithm comes from the FCA community. To read more about Next-Closure, please refer to [3].

Norris (-alg:norris)

Norris extracts all closed itemsets. For this algorithm, minimum support must be set to 1. The reason is that this is an incremental algorithm: it generates itemsets for the i first objects at the step i.

. start.sh sample/laszlo.rcf -alg:norris -names

Sample output:

{a, b, c, e} (2) +
{c} (4) +

Notes: This algorithm comes from the FCA community. To read more about Norris, please refer to [3].

Pascal (-alg:pascal)

A levelwise alg. to find FIs. It also marks FGs. However, it does not find FCIs.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:pascal

Sample output:

{b, c} (3) [key: +]
{b, e} (4) [key: -]

Meaning: BC is an FG with support 3. BE is an FI (but not an FG) with support 4.

Notes: to read more about Pascal, please refer to [17].

Pascal+ (-alg:pascalplus, -alg:pascal+)

An extension of Pascal to identify FCIs too. That it, it finds all FIs and it marks FGs and FCIs.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:pascalplus

Sample output:

{b, c} (3) [key: +]
{b, e} (4) [key: -] +

Meaning: BC is an FG with support 3. BE is an FCI but not an FG (with support 4).

Notes: to read more about Pascal+, please refer to [18].

Pseudo-Closed (-alg:pseudoclosed, -alg:pc)

A levelwise alg. to find frequent pseudo-closed itemsets (FPCIs). These itemsets are required for instance for the Duquenne-Guigues basis.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:pseudoclosed

Sample output:

{b} (4) [pseudo-closed: +; closure: {b, e}]

Meaning: B is an FPCI with support 4. Its closure is BE.

Notes: to read more about Pseudo-Closed, please refer to [8].

Talky (-alg:talky)

A very efficient depth-first alg. to extract FIs. It is like Eclat but with a different traversal strategy. Here we use reverse pre-order traversal from right to left. It has the advantage that whenever an itemset X is reached, all its subsets were already discovered. Its output is like Eclat's, but the FIs are produced in a different order.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:talky

Sample output:

{b} (4)

Meaning: B is an FI with support 4.

Notes: to read more about Talky, please refer to [13].

Talky-G (-alg:talkyg)

A very efficient depth-first alg. to extract FGs.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:talkyg

Sample output:

{b, c} (3) [key: +]

Meaning: BC is an FG with support 3.

Notes: to read more about Talky-G, please refer to [13].

Touch (-alg:touch)

A very efficient depth-first alg. to extract FCI/FG pairs. Touch consists of 3 steps: (1) extract FCIs with Charm; (2) extract FGs with Talky-G; and (3) associate the FGs to their closures. Output: all frequent equivalence classes.

./core01_coron.sh sample/laszlo.rcf 3 -names -alg:touch

Sample output:

{b, c, e} (3) +; [{b, c}, {c, e}]

Meaning: BCE is a closed itemset with support 3. In its equivalence class there are two FGs namely BC and CE.

Notes: to read more about Touch, please refer to [13].

Zart (-alg:zart)

An extension of Pascal to produce FCI/FG pairs. Note that it was our first alg. to produce this kind of output. Zart identifies FCIs and associates the FGs to their closures.

 ./core01_coron.sh sample/laszlo.rcf 3 -names -alg:zart

Sample output:

{b, c, e} (3) +; [{b, c}, {c, e}]

Meaning: BCE is a closed itemset with support 3. In its equivalence class there are two FGs namely BC and CE.

Notes: to read more about Zart, please refer to [18].


Bibliography
1. Towards Rare Itemset Mining. Szathmary, L., Napoli, A. and Valtchev, P. In Proc. of the 19th IEEE Intl. Conf. on Tools with Artificial Intelligence (ICTAI '07), pages 305-312, Patras, Greece, Oct 2007. [ PDF, BibTeX ]
2. Symbolic Data Mining Methods with the Coron Platform. Szathmary, L. PhD thesis, University Henri Poincaré — Nancy 1, France, Nov 2006. [ PDF, BibTeX ]
3. Comparing performance of algorithms for generating concept lattices. Sergei O. Kuznetsov, Sergei A. Obiedkov, J. Exp. Theor. Artif. Intell. 14(2-3): 189-216 (2002)
4. Discovering Frequent Closed Itemsets for Association Rules. Pasquier, N., Bastide, Y., Taouil, R. and Lakhal, L. In Proc. of the 7th Intl. Conf. on Database Theory (ICDT '99), pages 398-416, 1999.
5. Fast Algorithms for Mining Association Rules in Large Databases. Agrawal, R. and Srikant, R. In Proc. of the 20th Intl. Conf. on Very Large Data Bases (VLDB '94), pages 487-499, 1994.
6. Efficient algorithms for discovering association rules. Mannila, H., Toivonen, H. and Verkamo, A. I. In Proc. of the AAAI Workshop on Knowledge Discovery in Databases (KDD '94), pages 181-192, Seattle, Washington, USA, July 1994.
7. Fast discovery of association rules. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. and Verkamo, A. I.. In Advances in knowledge discovery and data mining, pages 307-328.American Association for Artificial Intelligence, , 1996.
8. Closed Set Based Discovery of Small Covers for Association Rules. Pasquier, N., Bastide, Y., Taouil, R. and Lakhal, L. In Proc. 15emes Journees Bases de Donnees Avancees (BDA), pages 361-381, 1999.
9. Finding Sporadic Rules Using Apriori-Inverse. Koh, Y.S. and Rountree, N. In Proc. of the 9th Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining (PAKDD '05), Hanoi, Vietnam, pages 97-106, May 2005.
10. ChARM: An Efficient Algorithm for Closed Itemset Mining. Zaki, M. J. and Hsiao, C.-J. In SIAM Intl. Conf. on Data Mining (SDM' 02), pages 33-43, 2002.
11. Efficient mining of association rules using closed itemset lattices. Pasquier, N., Bastide, Y., Taouil, R. and Lakhal, L.. Inf. Syst., 24(1):25-46, 1999.
12. Fast vertical mining using diffsets. Zaki, M. J. and Gouda, K. In Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge discovery and data mining (KDD '03), pages 326-335, New York, NY, USA, 2003.
13. Efficient Vertical Mining of Frequent Closures and Generators. Szathmary, L., Valtchev, P., Napoli, A. and Godin, R. In Proc. of the 8th Intl. Symposium on Intelligent Data Analysis (IDA '09), pages 393-404, Lyon, France, 2009. [ PDF, BibTeX ]
14. New Algorithms for Fast Discovery of Association Rules. Zaki, M. J., Parthasarathy, S., Ogihara, M. and Li, W. In Proc. of the 3rd Intl. Conf. on Knowledge Discovery in Databases, pages 283-286, 1997.
15. Scalable Algorithms for Association Mining. Zaki, M. J.. IEEE Transactions on Knowledge and Data Engineering, 12(3):372-390, 2000.
16. An Efficient Hybrid Algorithm for Mining Frequent Closures and Generators. Szathmary, L., Valtchev, P., Napoli, A. and Godin, R. In Proc. of the 6th Intl. Conf. on Concept Lattices and Their Applications (CLA '08), pages 47-58, Olomouc, Czech Republic, Oct 2008. [ PDF, BibTeX ]
17. Mining Frequent Patterns with Counting Inference. Bastide, Y., Taouil, R., Pasquier, N., Stumme, G. and Lakhal, L.. SIGKDD Explor. Newsl., 2(2):66-75, 2000.
18. ZART: A Multifunctional Itemset Mining Algorithm. Szathmary, L., Napoli, A. and Kuznetsov, S. O. In Proc. of the 5th Intl. Conf. on Concept Lattices and Their Applications (CLA '07), pages 26-37, 2007. (URL) [ PDF, BibTeX ]
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License