![]() |
SimilarityMatrixTextFiles.pl - Calculate similarity matrices using fingerprints data in TextFile(s)
SimilarityMatrixTextFiles.pl TextFile(s)...
SimilarityMatrixTextFiles.pl [-a, --alpha number] [b, --beta number] [-c, --ColMode ColNum | ColLabel] [--CompoundIDCol col number | col name] [-d, --detail InfoLevel] [-f, --fast] [--FingerprintsCol col number | col name] [--FingerprintsFormatMode Internal | Specify] [--FingerprintsString Hexadecimal | Binary | RawBinary] [-h, --help] [-m, --mode All | ''Tanimoto,[Tversky,...]''] [--InDelim comma | semicolon] [--OutDelim comma | tab | semicolon] [-o, --overwrite] [-p, --precision number] [-q, --quote Yes | No] [-r, --root RootName] [-w, --WorkingDir dirname] TextFile(s)...
Calculate similarity matrices using fingerprints data column specified by a column number or label in TextFile(s) and generate CSV/TSV text files containing values for specified similarity coefficients.
The valid file extensions are .csv and .tsv for comma/semicolon and tab delimited text files respectively. All other file names are ignored. All the text files in a current directory can be specified by *.csv, *.tsv, or the current directory name. The --indelim option determines the format of TextFile(s). Any file which doesn't correspond to the format indicated by --indelim option is ignored.
Internal fingerprints string format consists of four parts delimited by semicolon: <Type:StringType:Size:String>. For example:
For Specify value of --FingerprintsFormatMode option, --FingerprintsString is used to interpret fingerprints string.
All uses complete list of supported similarity coefficients: BaroniUrbani, Buser, Cosine, Dice, Dennis, Euclid, Forbes, Fossum, Hamann, Jacard, Kulczynski1, Kulczynski2, Manhattan, Matching, McConnaughey, Ochiai, Pearson, RogersTanimoto, RussellRao, Simpson, SkoalSneath1, SkoalSneath2, SkoalSneath3, Tanimoto, Tversky, Yule, WeightedTanimoto, WeightedTversky. These similarity coefficients are described below.
For two fingerprints bit vectors A and B of same size, let:
Then, various similarity coefficients [ Ref. 40 - 42 ] for a pair of bit vectors A and B are defined as follows:
BaroniUrbani: ( SQRT( Nc * Nd ) + Nc ) / ( SQRT ( Nc * Nd ) + Nc + ( Na - Nc ) + ( Nb - Nc ) ) ( same as Buser )
Buser: ( SQRT ( Nc * Nd ) + Nc ) / ( SQRT ( Nc * Nd ) + Nc + ( Na - Nc ) + ( Nb - Nc ) ) ( same as BaroniUrbani )
Cosine: Nc / SQRT ( Na * Nb ) (same as Ochiai)
Dice: (2 * Nc) / ( Na + Nb )
Dennis: ( Nc * Nd - ( ( Na - Nc ) * ( Nb - Nc ) ) ) / SQRT ( Nt * Na * Nb)
Euclid: SQRT ( ( Nc + Nd ) / Nt )
Forbes: ( Nt * Nc ) / ( Na * Nb )
Fossum: ( Nt * ( ( Nc - 1/2 ) ** 2 ) / ( Na * Nb )
Hamann: ( ( Nc + Nd ) - ( Na - Nc ) - ( Nb - Nc ) ) / Nt
Jaccard: Nc / ( ( Na - Nc) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc ) (same as Tanimoto)
Kulczynski1: Nc / ( ( Na - Nc ) + ( Nb - Nc) ) = Nc / ( Na + Nb - 2Nc )
Kulczynski2: ( ( Nc / 2 ) * ( 2 * Nc + ( Na - Nc ) + ( Nb - Nc) ) ) / ( ( Nc + ( Na - Nc ) ) * ( Nc + ( Nb - Nc ) ) ) = 0.5 * ( Nc / Na + Nc / Nb )
Manhattan: ( ( Na - Nc ) + (Nb - Nc) ) / Nt = ( Na + Nb - 2Nc ) / Nt
Matching: ( Nc + Nd ) / Nt
McConnaughey: ( Nc ** 2 - ( Na - Nc ) * ( Nb - Nc) ) / ( Na * Nb )
Ochiai: Nc / SQRT ( Na * Nb ) (same as Cosine)
Pearson: ( ( Nc * Nd ) - ( ( Na - Nc ) * ( Nb - Nc ) ) / SQRT ( Na * Nb * ( Na - Nc + Nd ) * ( Nb - Nc + Nd ) )
RogersTanimoto: ( Nc + Nd ) / ( ( Na - Nc) + ( Nb - Nc) + Nt) = ( Nc + Nd ) / ( Na + Nb - 2Nc + Nt)
RussellRao: Nc / Nt
Simpson: Nc / MIN ( Na, Nb)
SkoalSneath1: Nc / ( Nc + 2 * ( Na - Nc) + 2 * ( Nb - Nc) ) = Nc / ( 2 * Na + 2 * Nb - 3 * Nc )
SkoalSneath2: ( 2 * Nc + 2 * Nd ) / ( Nc + Nd + Nt )
SkoalSneath3: ( Nc + Nd ) / ( ( Na - Nc ) + ( Nb - Nc ) ) = ( Nc + Nd ) / ( Na + Nb - 2 * Nc )
Tanimoto: Nc / ( ( Na - Nc) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc ) (same as Jaccard)
Tversky: Nc / ( alpha * ( Na - Nc ) + ( 1 - alpha) * ( Nb - Nc) + Nc ) = Nc / ( alpha * ( Na - Nb ) + Nb)
Yule: ( ( Nc * Nd ) - ( ( Na - Nc ) * ( Nb - Nc ) ) ) / ( ( Nc * Nd ) + ( ( Na - Nc ) * ( Nb - Nc ) ) )
Values of Tanimoto/Jaccard and Tversky coefficients are dependent on only those bit which are set to ''1'' in both A and B. In order to take into account all bit positions, modified versions of Tanimoto [ Ref. 42 ] and Tversky [ Ref. 43 ] have been developed.
Let:
Tanimoto': Nc' / ( ( Na' - Nc') + ( Nb' - Nc' ) + Nc' ) = Nc' / ( Na' + Nb' - Nc' )
Tversky': Nc' / ( alpha * ( Na' - Nc' ) + ( 1 - alpha) * ( Nb' - Nc' ) + Nc' ) = Nc' / ( alpha * ( Na' - Nb' ) + Nb')
Then:
WeightedTanimoto = beta * Tanimoto + (1 - beta) * Tanimoto'
WeightedTversky = beta * Tversky + (1 - beta) * Tversky'
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints data in any internal fingerprint format present in a column name containing Fingerprint substring and create a SampleFPTanimoto.csv file containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate similarity matrices corresponding to all supported similarity coefficients for fingerprints data in any internal fingerprint format present in a column name containing Fingerprint substring and create SampleFP[CoefficientName].csv files containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate similarity matrices corresponding to Buser, Dice and Tanimoto similarity coefficients for fingerprints data in any internal fingerprint format present in a column name containing Fingerprint substring and create SampleFP[CoefficientName].csv files containing compound IDs retrieved from column name containing CompoundID substring, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints data in any intern fingerprint format present in a column number 2 and create a SampleFPTanimoto.csv file containing compound IDs retrieved from column number 1, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints data in any internal fingerprint format in column named PathLengthFingerprints and create a SampleFPTanimoto.csv file containing compound IDs retrieved from column named CompoundID, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints data in as hexadecimal bit-string format in column named Fingerprints and create a SampleFPTanimoto.csv file containing compound IDs retrieved from column named MolID, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints data in any internal fingerprint format present in a column name containing Fingerprint substring and create a SampleFPTanimoto.tsv file without any quotes around values along with compound IDs retrieved from column name containing CompoundID substring, type:
InfoFingerprintsTextFiles.pl, PathLengthFingerprints.pl, SimilarityMatrixSDFiles.pl
Copyright (C) 2004-2008 Manish Sud. All rights reserved.
This file is part of MayaChemTools.
MayaChemTools is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.