![]() |
SimilarityMatrixSDFiles.pl - Calculate similarity matrices using fingerprints data in SDFile(s)
SimilarityMatrixSDFiles.pl SDFile(s)...
SimilarityMatrixSDFiles.pl [-a, --alpha number] [b, --beta number] [--CompoundID DataFieldName or LabelPrefixString] [--CompoundIDMode DataField | MolName | LabelPrefix | MolNameOrLabelPrefix] [-d, --detail InfoLevel] [-f, --fast] [--FingerprintsField FieldLabel] [--FingerprintsFormatMode Internal | Specify] [--FingerprintsString Hexadecimal | Binary | RawBinary] [-h, --help] [-m, --mode All | ''Tanimoto,[Tversky,...]''] [--OutDelim comma | tab | semicolon] [-o, --overwrite] [-p, --precision number] [-q, --quote Yes | No] [-r, --root RootName] [-w, --WorkingDir dirname] SDFile(s)...
Calculate similarity matrices using fingerprints data field in SDFile(s) and generate CSV/TSV text files containing values for specified similarity coefficients.
Multiple SDFile names are separated by spaces. The valid file extensions are .sdf and .sd. All other file names are ignored. All the SD files in a current directory can be specified either by *.sdf or the current directory name.
For DataField value of --CompoundIDMode option, this option corresponds to datafield label name whose value is used as compound ID; otherwise, it's a prefix string used for generating compound IDs like LabelPrefixString<Number>. Default value, Cmpd, generates compound IDs which look like Cmpd<Number>.
Examples for DataField value of --CompoundIDMode:
Examples for LabelPrefix or MolNameOrLabelPrefix value of --CompoundIDMode:
The values specified above generates compound IDs which correspond to Compound<Number> instead of default value of Cmpd<Number>.
Possible values: DataField | MolName | LabelPrefix | MolNameOrLabelPrefix. Default: LabelPrefix.
For MolNameAndLabelPrefix value of --CompoundIDMode, molname line in SDFile(s) takes precedence over sequential compound IDs generated using LabelPrefix and only empty molname values are replaced with sequential compound IDs.
Internal fingerprints string format consists of four parts delimited by semicolon: <Type:StringType:Size:String>. For example:
For Specify value of --FingerprintsFormatMode option, --FingerprintsString is used to interpret fingerprints string.
All uses complete list of supported similarity coefficients: BaroniUrbani, Buser, Cosine, Dice, Dennis, Euclid, Forbes, Fossum, Hamann, Jacard, Kulczynski1, Kulczynski2, Manhattan, Matching, McConnaughey, Ochiai, Pearson, RogersTanimoto, RussellRao, Simpson, SkoalSneath1, SkoalSneath2, SkoalSneath3, Tanimoto, Tversky, Yule, WeightedTanimoto, WeightedTversky. These similarity coefficients are described below.
For two fingerprints bit vectors A and B of same size, let:
Then, various similarity coefficients [ Ref. 40 - 42 ] for a pair of bit vectors A and B are defined as follows:
BaroniUrbani: ( SQRT( Nc * Nd ) + Nc ) / ( SQRT ( Nc * Nd ) + Nc + ( Na - Nc ) + ( Nb - Nc ) ) ( same as Buser )
Buser: ( SQRT ( Nc * Nd ) + Nc ) / ( SQRT ( Nc * Nd ) + Nc + ( Na - Nc ) + ( Nb - Nc ) ) ( same as BaroniUrbani )
Cosine: Nc / SQRT ( Na * Nb ) (same as Ochiai)
Dice: (2 * Nc) / ( Na + Nb )
Dennis: ( Nc * Nd - ( ( Na - Nc ) * ( Nb - Nc ) ) ) / SQRT ( Nt * Na * Nb)
Euclid: SQRT ( ( Nc + Nd ) / Nt )
Forbes: ( Nt * Nc ) / ( Na * Nb )
Fossum: ( Nt * ( ( Nc - 1/2 ) ** 2 ) / ( Na * Nb )
Hamann: ( ( Nc + Nd ) - ( Na - Nc ) - ( Nb - Nc ) ) / Nt
Jaccard: Nc / ( ( Na - Nc) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc ) (same as Tanimoto)
Kulczynski1: Nc / ( ( Na - Nc ) + ( Nb - Nc) ) = Nc / ( Na + Nb - 2Nc )
Kulczynski2: ( ( Nc / 2 ) * ( 2 * Nc + ( Na - Nc ) + ( Nb - Nc) ) ) / ( ( Nc + ( Na - Nc ) ) * ( Nc + ( Nb - Nc ) ) ) = 0.5 * ( Nc / Na + Nc / Nb )
Manhattan: ( ( Na - Nc ) + (Nb - Nc) ) / Nt = ( Na + Nb - 2Nc ) / Nt
Matching: ( Nc + Nd ) / Nt
McConnaughey: ( Nc ** 2 - ( Na - Nc ) * ( Nb - Nc) ) / ( Na * Nb )
Ochiai: Nc / SQRT ( Na * Nb ) (same as Cosine)
Pearson: ( ( Nc * Nd ) - ( ( Na - Nc ) * ( Nb - Nc ) ) / SQRT ( Na * Nb * ( Na - Nc + Nd ) * ( Nb - Nc + Nd ) )
RogersTanimoto: ( Nc + Nd ) / ( ( Na - Nc) + ( Nb - Nc) + Nt) = ( Nc + Nd ) / ( Na + Nb - 2Nc + Nt)
RussellRao: Nc / Nt
Simpson: Nc / MIN ( Na, Nb)
SkoalSneath1: Nc / ( Nc + 2 * ( Na - Nc) + 2 * ( Nb - Nc) ) = Nc / ( 2 * Na + 2 * Nb - 3 * Nc )
SkoalSneath2: ( 2 * Nc + 2 * Nd ) / ( Nc + Nd + Nt )
SkoalSneath3: ( Nc + Nd ) / ( ( Na - Nc ) + ( Nb - Nc ) ) = ( Nc + Nd ) / ( Na + Nb - 2 * Nc )
Tanimoto: Nc / ( ( Na - Nc) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc ) (same as Jaccard)
Tversky: Nc / ( alpha * ( Na - Nc ) + ( 1 - alpha) * ( Nb - Nc) + Nc ) = Nc / ( alpha * ( Na - Nb ) + Nb)
Yule: ( ( Nc * Nd ) - ( ( Na - Nc ) * ( Nb - Nc ) ) ) / ( ( Nc * Nd ) + ( ( Na - Nc ) * ( Nb - Nc ) ) )
Values of Tanimoto/Jaccard and Tversky coefficients are dependent on only those bit which are set to ''1'' in both A and B. In order to take into account all bit positions, modified versions of Tanimoto [ Ref. 42 ] and Tversky [ Ref. 43 ] have been developed.
Let:
Tanimoto': Nc' / ( ( Na' - Nc') + ( Nb' - Nc' ) + Nc' ) = Nc' / ( Na' + Nb' - Nc' )
Tversky': Nc' / ( alpha * ( Na' - Nc' ) + ( 1 - alpha) * ( Nb' - Nc' ) + Nc' ) = Nc' / ( alpha * ( Na' - Nb' ) + Nb')
Then:
WeightedTanimoto = beta * Tanimoto + (1 - beta) * Tanimoto'
WeightedTversky = beta * Tversky + (1 - beta) * Tversky'
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints data in any internal fingerprint format present in a data field with Fingerprint substring in its label and create a SampleFPTanimoto.csv file containing sequentially generated compound IDs with Cmpd prefix, type:
To generate similarity matrices corresponding to all supported similarity coefficient for fingerprints data in any internal fingerprint format present in a data field with Fingerprint substring in its label and create a SampleFPTanimoto.csv file containing sequentially generated compound IDs with Cmpd prefix, type:
To generate a similarity matrix corresponding to Buser, Dice and Tanimoto similarity coefficient for fingerprints data in any internal fingerprint format present in a data field with Fingerprint substring in its label and create a SampleFPTanimoto.csv file containing sequentially generated compound IDs with Cmpd prefix, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints data in any internal fingerprint format present in a data field name PathLengthFingerprints and create a SampleFPTanimoto.csv file containing compound IDs present in data field name Cmpd_ID with Cmpd prefix, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints data in any binary bit-string format present in a data field with Fingerprint substring in its label and create a SampleFPTanimoto.csv file containing sequentially generated compound IDs with Cmpd prefix, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints data in any internal fingerprint format present in a data field with Fingerprint substring in its label and create a SampleFPTanimoto.csv file containing compound IDs from molname line or sequentially generated compound IDs with Mol prefix, type generated compound IDs with Cmpd prefix, type:
To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints data in any internal fingerprint format present in a data field with Fingerprint substring in its label and create a SampleFPTanimoto.tsv file without any quotes aroud values along with sequentially generated compound IDs with Cmpd prefix, type:
InfoFingerprintsSDFiles.pl, PathLengthFingerprints.pl, SimilarityMatrixTextFiles.pl
Copyright (C) 2004-2008 Manish Sud. All rights reserved.
This file is part of MayaChemTools.
MayaChemTools is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.