Clustering

From BITS wiki
Jump to: navigation, search

[ Overviews | Main_Page ]



Clustering is a common computational technique for data analysis in life sciences. Clustering tries to partition data in groups which have similar characteristics.

Software

TransClust

TransClust (available for Win/Lin/Mac) [1] is a high-throughput clustering software that is based on Weighted Transitive Graph Projection. It's main advantage over other approaches is that it's underlying model directly reflects hidden transitive substructures typical e.g. for biomedical data sets. In comparison to other clustering methods the density parameter (the threshold) can be chosen as intuitively as for k-means .

TransClust can be used:

  • Within Cytoscape as a plugin (see tutorial)
  • On the developer web-server (link and Tutorial)
  • From the terminal as a graphical user interface (java -jar -Xmx2G -Xss100M TransClust.jar -gui)
  • In the terminal as a standalone command line application with required parameters. adding -gui to the java command will open a graphical interface.

The help page is reproduced here to help you build a valid command

## USAGE
        java -jar [java virtual machine options] TransClust.jar [-key
        value]

        e.g. java -jar -Xmx2G -Xss100M TransClust.jar -i cost_matrix_dir
        -o clusters.cls

        Note: If the input is large and/or complex then the virtual machine
        options must be set.
        Any values that inlude spaces must be surrounded by quotation marks
        '"'.
        { } denotes the value choices, [ ] means that the value is a list,
        and ' ' surrounds a description of the value.
        Further note that the keys are not case sensitive, but the class
        names of the respective implementations are!


## COMPULSORY OPTIONS
        One of the following must be entered.

        -key  value

        -i  {'inputdir', 'costmatrixfile.cm'}  
                Input file or directory.

        -o  {'output.file', 'output.conf'}  
                Output file for the clustering results or the generated config
                file.
        OR

        -gui  {[OTHER OPTIONS]}  
                Start the program with the graphical user interface. It is
                also possible to initialise the gui with the OTHER OPTIONS
                defined below!
        OR

        -help  {}  
                Show this help manual.


## OTHER OPTIONS
        These are optional. All parameters that are not specified here are
        first taken from the input config file if stated,otherwise from
        the default config file that comes with this program. IMPORTANT:
        The given input parameter values override any values written in
        the config files.

        -key  value  (default value)

        EXTRA (not defined in the config file)

        -verbose  {}  ()
                Write a short summary of the program results to the standard
                output (console).

        -cf  {true, false}  (false)
                Use config file (true) or hard coded standard options (false).
               
        -config  {'config.conf'}  
                A config file with the program parameters in the correct
                format (see documentation for details).

        -mode  {0,1}  (0)
                Determines the mode in which the program should be started
                0       Default clustering mode: clustering of given input
                        and writing the clusters to the output file.
                1       General training mode: trains a set of data (cost
                        matrices) and writes the generated parameters in
                        the output file.

        -info  {'file.info'}  
                A summary of what functions the program carried out.
                        This file includes information such the date, the
                        input and output files, which mode the program was
                        carried out in, and which processes were done using
                        which implementations.

        -log  {ALL,FINEST,FINER,FINE,CONFIG,INFO,WARNING,SEVERE,OFF}  (OFF)
                Defines the level of logging from the most sensitive level
                to completely off.

        GENERAL

        -l  ['layouterClass']  (FORCEnDLayouter)
                A List of class names of layouter implementations. These
                implementations are then used for the layouting phase in the
                order they are given. Each name should be separated by a ","
                (comma). E.g. FORCEnDLayouter,ACCLayouter  or for just one
                layouter, then only e.g. FORCEnDLayouter.
                Implemented Classes     {FORCEnDLayouter, ACCLayouter}

        -g  {'geometricClustererClass'}  (SingleLinkageClusterer)
                The class name of the geometric clustering implementation.
                Implemented Classes     {SingleLinkageClusterer, KmeansClusterer}

        -p  {'postProcessorClass'}  (PP_DivideAndReclusterRecursively)
                The class name of the post processing implementation. Write
                'none' if post-processing should NOT be carried out.
                Implemented Classes     {PP_RearrangeAndMergeBest, PP_DivideAndRecluster,
                        PP_DivideAndReclusterRecursively}

        -e  {ICCEdgesImplementation}  (CC2DArray)
                The class name of the implementation of the ICCEdges interface
                describing the datastructure for the costs between objects.
                Implemented Classes     {FORCEnDLayouter, ACCLayouter}

        -t  {1,...,max no. CPUs}  (3)
                Turn the use of multiple threads on and give the maximum no.
                of parallel threads (do not give a number greater than the
                number of CPUs your system has).

        GENERAL LAYOUT

        -ld  {2,...,n}  (3)
                The dimension in which the layouters should run in. NOTE:
                Because of runtime reasons, ACCLayouter only makes sense for
                dimensions 2 and 3.

        -lp  {'parameterTrainingClass'}  (ParameterTraining_SE)
                The class name of the parameter training implementation. Write
                'none' if parameter training should NOT be carried out.
                Implemented Classes     {ParameterTraining_SE}

        -lps  {2,...,n}  (15)
                Number of parameter configurations for each generation in the
                parameter training.

        -lpn  {1,...,n}  (3)
                The number of generations that should be used for parameter
                training.

        FORCEnDLayouter

        -fa  {'double'}  (100.0)
                The value for the attraction factor.

        -fr  {'double'}  (100.0)
                The value for the repulsion factor.

        -fi  {'integer'}  (100)
                Number of iterations.

        -ft  {'float'}  (100.0)
                The cooling temperature value for the convergence of the
                layout.

        ACCLayouter

        -aix  {'integer'}  (10000)
                The multiplication factor for the number of iterations.
                (Iterations = number of items * factor)

        -agx  {'integer'}  (25)
                Multiplication factor for the grid size. (Places on the grid
                = number of items * factor)

        -asx  {'integer'}  (15)
                Multiplication factor for the maximum step size. Please choose
                this smaller then the multiplicator for the grid size.

        -at  {'antTypeClass'}  (MemoryAnt)
                The class name of the type of ant to be used. ('SimpleAnt',
                'JumpingAnt', 'JumpingAntWithIncreasingViewSize' or 'MemoryAnt')
               
        -akp  {'double'}  (0.15)
                kp value, the higher this value the higher the probability
                to pick up items.

        -akd  {'doube'}  (0.2)
                kd value, the higher this value the higher the probability
                to drop items.

        -an  {'integer'}  (1)
                Number of ants.

        -am  {'integer'}  (50)
                Memory size: The number of items that the ant remembers.

        -aa  {'double'}  (1.0)
                The value of the factor alpha for the neighbourhood function.
                (Scales the dissimilarities)

        -as  {'integer'}  (20)
                The maximum step size.

        -av  {'integer'}  (2)
                The maximum view size. Only used with JumpingAntsWithIncreas
                ingViewField and MemoryAnts.

        -az  {'double'}  (1.0)
                Normalisation threshold.

        GEOMETRIC CLUSTERING

        SingleLinkageClusterer

        -sm  {'double'}  (0.01)
                The minimum distance.

        -sx  {'double'}  (5.0)
                The maximum distance to look at.

        -ss  {'double'}  (0.01)
                The step size.

        -sf  {'double'}  (0.01)
                The step size factor.

        KmeansClusterer

        -km  {'integer'}  (30)
                The maximum k value that is allowed. This means the maximum
                number of clusters that the input can be divided into.

        -ki  {'integer'}  (1)
                Maximum number of different initial starting point combinations
                (for one k) that k-means uses.

        OverlappingClustering

        -fuzzy  {'double'}  false
                Fuzzy threshold to compute overlapping clustering (give to
                activate fuzzy clustering) - disabled by default.

        -fb  {'double'}  lowest observed value
                Fallback value used to create costmatrices


Technical.png No tutorial is provided for the command line usage but training data is available from the Description page



References:
  1. Tobias Wittkop, Dorothea Emig, Sita Lange, Sven Rahmann, Mario Albrecht, John H Morris, Sebastian Böcker, Jens Stoye, Jan Baumbach
    Partitioning biological data with transitivity clustering.
    Nat Methods: 2010, 7(6);419-20
    [PubMed:20508635] ##WORLDCAT## [DOI] (I p)



[ Main_Page ]