MayaChemTools

Previous  TOC  NextSimilarityMatrixSDFiles.plCode | PDF | PDFGreen | PDFA4 | PDFA4Green

NAME

SimilarityMatrixSDFiles.pl - Calculate similarity matrices using fingerprints strings data in SDFile(s)

SYNOPSIS

SimilarityMatrixSDFiles.pl SDFile(s)...

SimilarityMatrixSDFiles.pl [--alpha number] [--beta number] [-b, --BitVectorComparisonMode All | ''TanimotoSimilarity,[ TverskySimilarity, ... ]''] [--CompoundID DataFieldName or LabelPrefixString] [--CompoundIDMode DataField | MolName | LabelPrefix | MolNameOrLabelPrefix] [-d, --detail InfoLevel] [-f, --fast] [--FingerprintsField FieldLabel][-h, --help] [-m, --mode AutoDetect | FingerprintsBitVectorString | FingerprintsVectorString] [--OutDelim comma | tab | semicolon] [--OutMatrixFormat RowsAndColumns | IDPairsAndValue] [-o, --overwrite] [-p, --precision number] [-q, --quote Yes | No] [-r, --root RootName] [-v, --VectorComparisonMode All | ''TanimotoSimilairy, [ ManhattanDistance, ...]''] [--VectorComparisonFormulism All | AlgebraicForm | BinaryForm | SetTheoreticForm] [-w, --WorkingDir dirname] SDFile(s)...

DESCRIPTION

Calculate similarity matrices using using fingerprint bit-vector or vector strings data field in SDFile(s) and generate CSV/TSV text files containing values for specified similarity and distance coefficients.

Multiple SDFile names are separated by spaces. The valid file extensions are .sdf and .sd. All other file names are ignored. All the SD files in a current directory can be specified either by *.sdf or the current directory name.

OPTIONS

--alpha number
Value of alpha parameter for calculating Tversky similarity coefficient specified for -b, --BitVectorComparisonMode option. It corresponds to weights assigned for bits set to ''1'' in a pair of fingerprint bit-vectors during the calculation of similarity coefficient. Possible values: 0 to 1. Default value: <0.5>.

--beta number
Value of beta parameter for calculating WeightedTanimoto and WeightedTversky similarity coefficients specified for -b, --BitVectorComparisonMode option. It is used to weight the contributions of bits set to ''0'' during the calculation of similarity coefficients. Possible values: 0 to 1. Default value of <1> makes WeightedTanimoto and WeightedTversky equivalent to Tanimoto and Tversky.

-b, --BitVectorComparisonMode All | ''TanimotoSimilarity,[TverskySimilarity,...]''
Specify what similarity coefficients to use for calculating similarity matrices for fingerprints bit-vector strings data values in TextFile(s): calculate similarity matrices for all supported similarity coefficients or specify a comma delimited list of similarity coefficients. Possible values: All | ''TanimotoSimilarity,[TverskySimilarity,...]. Default: TanimotoSimilarity

All uses complete list of supported similarity coefficients: BaroniUrbaniSimilarity, BuserSimilarity, CosineSimilarity, DiceSimilarity, DennisSimilarity, ForbesSimilarity, FossumSimilarity, HamannSimilarity, JacardSimilarity, Kulczynski1Similarity, Kulczynski2Similarity, MatchingSimilarity, McConnaugheySimilarity, OchiaiSimilarity, PearsonSimilarity, RogersTanimotoSimilarity, RussellRaoSimilarity, SimpsonSimilarity, SkoalSneath1Similarity, SkoalSneath2Similarity, SkoalSneath3Similarity, TanimotoSimilarity, TverskySimilarity, YuleSimilarity, WeightedTanimotoSimilarity, WeightedTverskySimilarity. These similarity coefficients are described below.

For two fingerprint bit-vectors A and B of same size, let:

Na = Number of bits set to "1" in A
Nb = Number of bits set to "1" in B
Nc = Number of bits set to "1" in both A and B
Nd = Number of bits set to "0" in both A and B
Nt = Number of bits set to "1" or "0" in A or B (Size of A or B)
Nt = Na + Nb - Nc + Nd
Na - Nc = Number of bits set to "1" in A but not in B
Nb - Nc = Number of bits set to "1" in B but not in A

Then, various similarity coefficients [ Ref. 40 - 42 ] for a pair of bit-vectors A and B are defined as follows:

BaroniUrbaniSimilarity: ( SQRT( Nc * Nd ) + Nc ) / ( SQRT ( Nc * Nd ) + Nc + ( Na - Nc ) + ( Nb - Nc ) ) ( same as Buser )

BuserSimilarity: ( SQRT ( Nc * Nd ) + Nc ) / ( SQRT ( Nc * Nd ) + Nc + ( Na - Nc ) + ( Nb - Nc ) ) ( same as BaroniUrbani )

CosineSimilarity: Nc / SQRT ( Na * Nb ) (same as Ochiai)

DiceSimilarity: (2 * Nc) / ( Na + Nb )

DennisSimilarity: ( Nc * Nd - ( ( Na - Nc ) * ( Nb - Nc ) ) ) / SQRT ( Nt * Na * Nb)

ForbesSimilarity: ( Nt * Nc ) / ( Na * Nb )

FossumSimilarity: ( Nt * ( ( Nc - 1/2 ) ** 2 ) / ( Na * Nb )

HamannSimilarity: ( ( Nc + Nd ) - ( Na - Nc ) - ( Nb - Nc ) ) / Nt

JaccardSimilarity: Nc / ( ( Na - Nc) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc ) (same as Tanimoto)

Kulczynski1Similarity: Nc / ( ( Na - Nc ) + ( Nb - Nc) ) = Nc / ( Na + Nb - 2Nc )

Kulczynski2Similarity: ( ( Nc / 2 ) * ( 2 * Nc + ( Na - Nc ) + ( Nb - Nc) ) ) / ( ( Nc + ( Na - Nc ) ) * ( Nc + ( Nb - Nc ) ) ) = 0.5 * ( Nc / Na + Nc / Nb )

MatchingSimilarity: ( Nc + Nd ) / Nt

McConnaugheySimilarity: ( Nc ** 2 - ( Na - Nc ) * ( Nb - Nc) ) / ( Na * Nb )

OchiaiSimilarity: Nc / SQRT ( Na * Nb ) (same as Cosine)

PearsonSimilarity: ( ( Nc * Nd ) - ( ( Na - Nc ) * ( Nb - Nc ) ) / SQRT ( Na * Nb * ( Na - Nc + Nd ) * ( Nb - Nc + Nd ) )

RogersTanimotoSimilarity: ( Nc + Nd ) / ( ( Na - Nc) + ( Nb - Nc) + Nt) = ( Nc + Nd ) / ( Na + Nb - 2Nc + Nt)

RussellRaoSimilarity: Nc / Nt

SimpsonSimilarity: Nc / MIN ( Na, Nb)

SkoalSneath1Similarity: Nc / ( Nc + 2 * ( Na - Nc) + 2 * ( Nb - Nc) ) = Nc / ( 2 * Na + 2 * Nb - 3 * Nc )

SkoalSneath2Similarity: ( 2 * Nc + 2 * Nd ) / ( Nc + Nd + Nt )

SkoalSneath3Similarity: ( Nc + Nd ) / ( ( Na - Nc ) + ( Nb - Nc ) ) = ( Nc + Nd ) / ( Na + Nb - 2 * Nc )

TanimotoSimilarity: Nc / ( ( Na - Nc) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc ) (same as Jaccard)

TverskySimilarity: Nc / ( alpha * ( Na - Nc ) + ( 1 - alpha) * ( Nb - Nc) + Nc ) = Nc / ( alpha * ( Na - Nb ) + Nb)

YuleSimilarity: ( ( Nc * Nd ) - ( ( Na - Nc ) * ( Nb - Nc ) ) ) / ( ( Nc * Nd ) + ( ( Na - Nc ) * ( Nb - Nc ) ) )

Values of Tanimoto/Jaccard and Tversky coefficients are dependent on only those bit which are set to ''1'' in both A and B. In order to take into account all bit positions, modified versions of Tanimoto [ Ref. 42 ] and Tversky [ Ref. 43 ] have been developed.

Let:

Na' = Number of bits set to "0" in A
Nb' = Number of bits set to "0" in B
Nc' = Number of bits set to "0" in both A and B

Tanimoto': Nc' / ( ( Na' - Nc') + ( Nb' - Nc' ) + Nc' ) = Nc' / ( Na' + Nb' - Nc' )

Tversky': Nc' / ( alpha * ( Na' - Nc' ) + ( 1 - alpha) * ( Nb' - Nc' ) + Nc' ) = Nc' / ( alpha * ( Na' - Nb' ) + Nb')

Then:

WeightedTanimotoSimilarity = beta * Tanimoto + (1 - beta) * Tanimoto'

WeightedTverskySimilarity = beta * Tversky + (1 - beta) * Tversky'

--CompoundID DataFieldName or LabelPrefixString
This value is --CompoundIDMode specific and indicates how compound ID is generated.

For DataField value of --CompoundIDMode option, this option corresponds to datafield label name whose value is used as compound ID; otherwise, it's a prefix string used for generating compound IDs like LabelPrefixString<Number>. Default value, Cmpd, generates compound IDs which look like Cmpd<Number>.

Examples for DataField value of --CompoundIDMode:

MolID
ExtReg

Examples for LabelPrefix or MolNameOrLabelPrefix value of --CompoundIDMode:

Compound

The values specified above generates compound IDs which correspond to Compound<Number> instead of default value of Cmpd<Number>.

--CompoundIDMode DataField | MolName | LabelPrefix | MolNameOrLabelPrefix
Specify how to generate compound IDs for similarity matrix CSV/TSV text file(s): use a SDFile(s) datafield value; use molname line from SDFile(s); generate a sequential ID with specific prefix; use combination of both MolName and LabelPrefix with usage of LabelPrefix values for empty molname lines.

Possible values: DataField | MolName | LabelPrefix | MolNameOrLabelPrefix. Default: LabelPrefix.

For MolNameAndLabelPrefix value of --CompoundIDMode, molname line in SDFile(s) takes precedence over sequential compound IDs generated using LabelPrefix and only empty molname values are replaced with sequential compound IDs.

-d, --detail InfoLevel
Level of information to print about lines being ignored. Default: 1. Possible values: 1, 2 or 3

-f, --fast
In this mode, fingerprints field specified using --FingerprintsField is assumed to contain valid fingerprints data and no checking is performed before calculating similarity matrices. By default, fingerprints data is validated before computing pairwise similarity coefficients.

--FingerprintsField FieldLabel
Fingerprints field label to use during calculation similarity matrices for SDFile(s). Default value: first data field label containing the word Fingerprints in its label

-h, --help
Print this help message.

-m, --mode AutoDetect | FingerprintsBitVectorString | FingerprintsVectorString
Format of fingerprint strings data in SDFile(s): automatically detect format of fingerprints string created by MayaChemTools fingerprints generation scripts or explicitly specify its format. Possible values: AutoDetect | FingerprintsBitVectorString | FingerprintsVectorString. Default value: AutoDetect.

The current release of MayaChemTools supports the following types of fingerprint bit-vector and vector strings:

FingerprintsBitVector;PathLengthBits:AtomicInvariantsAtomTypes;1024;
HexadecimalString;Ascending;00000000000000000000000040000000000000
000000000000000000000000000020200000000000000000004000000000000...
FingerprintsBitVector;PathLengthBits:AtomicInvariantsAtomTypes;1024;
BinaryString;Ascending;0000000000000000000000000000000000000000000
000000000000001000000001000000000010000000000000000000010000000...
FingerprintsVector;PathLengthCount:AtomicInvariantsAtomTypes;27;
NumericalValues;IDsAndValuesPairsString;C 8 O 1 C:C 8 C:O 2 C:C:C 9
C:C:O 3 C:O:C 1 C:C:C:C 10 C:C:C:O 4 C:C:O:C 3 C:C:C:C:C 10 ...
FingerprintsBitVector;MACCSKeyBits;166;BinaryString;Ascending;000000000
000000000000000000000000000000000000000000000000001000000000000
000000000010000000000001001000000000000000000001000000000000000...
FingerprintsBitVector;MACCSKeyBits;166;HexadecimalString;Ascending;0000
002000002010008040084010080100902805e1
FingerprintsBitVector;MACCSKeyBits;322;BinaryString;Ascending;1100000000
0000001000001000010011000001100000001000000000000000101000000000
0000000000000000000000000000000000000000000000000000100000000000...
FingerprintsBitVector;MACCSKeyBits;322;HexadecimalString;Ascending;3001
48c060400041000000000000000100000000000000000000000000000000500
000000000000000
FingerprintsVector;MACCSKeyCount;166;OrderedNumericalValues;ValuesString;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
FingerprintsVector;MACCSKeyCount;322;OrderedNumericalValues;ValuesString;
2 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 1 0 0 7 1 0 0 0 0 0 2 1 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 2 0 0 0 0 0 0 0 0 ...
FingerprintsVector;ExtendedConnectivity:AtomicInvariantsAtomTypes;14;
AlphaNumericalValues;ValuesString;333564680 1142173602 14814699391
977749791 2006158649 291020918 443330853 692611812 816539344173
1657806 2039728782 931045615 1273931663 1317501190
FingerprintsVector;ExtendedConnectivity:FunctionalClassAtomTypes;11;
AlphaNumericalValues;ValuesString;862102353 981185303 12517955598
10600886 885767127 1452087973 1878436093 2029559552 1465773182
1530666307 2113761516
FingerprintsVector;TopologicalAtomPairs:AtomicInvariantsAtomTypes;23;
NumericalValues;IDsAndValuesString;C.X1.BO1.H3-D1-C.X3.BO4 C.X2.BO3.H1-
D1-C.X2.BO3.H1 C.X2.BO3.H1-D1-C.X3.BO4 C.X2.BO3.H1-D1-N.X2.BO2.H1
C.X3.BO4-D1-C.X3.BO4 C.X3.BO4-D1-O.X1.BO2 C.X1.BO1.H3-D2-C.X2.BO3.H1
C.X1.BO1.H3-D2-C.X3.BO4...; 1 1 2 2 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 2 1 1 1
FingerprintsVector;TopologicalAtomPairs:AtomicInvariantsAtomTypes;23;
NumericalValues;IDsAndValuesPairsString;C.X1.BO1.H3-D1-C.X3.BO4 1
C.X2.BO3.H1-D1-C.X2.BO3.H1 1 C.X2.BO3.H1-D1-C.X3.BO4 2 C.X2.BO3.H1-
D1-N.X2.BO2.H1 2 C.X3.BO4-D1-C.X3.BO4 1 C.X3.BO4-D1-O.X1.BO2 1
C.X1.BO1.H3-D2-C.X2.BO3.H1 1 C.X1.BO1.H3-D2-C.X3.BO4 1
C.X2.BO3.H1-D2-C.X2.BO3.H1 1 C.X2.BO3.H1-D2-C.X3.BO4 3...
FingerprintsVector;TopologicalAtomTorsions:AtomicInvariantsAtomTypes;11;
NumericalValues;IDsAndValuesString;C.X1.BO1.H3-C.X3.BO4-C.X2.BO3.H1-
N.X2.BO2.H1 C.X1.BO1.H3-C.X3.BO4-C.X3.BO4-C.X2.BO3.H1 C.X1.BO1.H3-
C.X3.BO4-C.X3.BO4-O.X1.BO2 C.X2.BO3.H1-C.X2.BO3.H1-C.X3.BO4-C.X3.BO4
C.X2.BO3.H1-C.X2.BO3.H1-C.X3.BO4-O.X1.BO2...;
1 1 1 1 1 1 1 1 1 1 1
FingerprintsVector;TopologicalAtomTorsions:AtomicInvariantsAtomTypes;11;
NumericalValues;IDsAndValuesPairsString;C.X1.BO1.H3-C.X3.BO4-C.X2.BO3.H1-
N.X2.BO2.H1 1 C.X1.BO1.H3-C.X3.BO4-C.X3.BO4-C.X2.BO3.H1 1 C.X1.BO1.H3-
C.X3.BO4-C.X3.BO4-O.X1.BO2 1 C.X2.BO3.H1-C.X2.BO3.H1-C.X3.BO4-
C.X3.BO4 1 C.X2.BO3.H1-C.X2.BO3.H1-C.X3.BO4-O.X1.BO2 1 C.X2.BO3.H1-
C.X2.BO3.H1-N.X2.BO2.H1-C.X2.BO3.H1 1 C.X2.BO3.H1-C.X3.BO4-C.X3.BO4-
C.X2.BO3.H1 1...
FingerprintsVector;TopologicalPharmacophoreAtomPairs;150;
OrderedNumericalValues;ValuesString;1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 2 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0
0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
FingerprintsVector;TopologicalPharmacophoreAtomPairs;150;
OrderedNumericalValues;IDsAndValuesString;H-D1-H H-D1-HBA H-D1-HBD
H-D1-NI H-D1-PI HBA-D1-HBA HBA-D1-HBD HBA-D1-NI HBA-D1-PI HBD-D1-HBD
HBD-D1-NI HBD-D1-PI NI-D1-NI NI-D1-PI PI-D1-PI H-D2-H H-D2-HBA ...;
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 ...
FingerprintsVector;TopologicalPharmacophoreAtomTriplets;4960;
OrderedNumericalValues;ValuesString;0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
FingerprintsVector;TopologicalPharmacophoreAtomTriplets;4960;
OrderedNumericalValues;IDsAndValuesString;Ar1-Ar1-Ar1 Ar1-Ar1-H1
Ar1-Ar1-HBA1 Ar1-Ar1-HBD1 Ar1-Ar1-NI1 Ar1-Ar1-PI1 Ar1-H1-H1 ...;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0...
FingerprintsVector;AtomNeighborhoods:AtomicInvariantsAtomTypes;10;
AlphaNumericalValues;ValuesString;NR0-C.X2.BO3.H1-ATC1:NR1-C.X2.BO3.H1-
ATC1:NR1-C.X3.BO4-ATC1:NR2-C.X2.BO3.H1-ATC1:NR2-C.X3.BO4-ATC1:NR2-
N.X1.BO1.H2-ATC1 NR0-C.X2.BO3.H1-ATC1:NR1-C.X2.BO3.H1-ATC1:NR1-
C.X3.BO4-ATC1:NR2-C.X2.BO3.H1-ATC1:NR2-C.X3.BO4-ATC2 NR0-C.X2.BO3.H1-
ATC1:NR1-C.X2.BO3.H1-ATC2:NR2-C.X2.BO3.H1-ATC1:NR2-C.X3.BO4-ATC1...

--OutDelim comma | tab | semicolon
Delimiter for output CSV/TSV text file(s). Possible values: comma, tab, or semicolon Default value: comma.

--OutMatrixFormat RowsAndColumns | IDPairsAndValue
Specify how similarity or distance values calculated for fingerprints vector and bit-vector strings are written to the output CSV/TSV text file(s): generate text files containing rows and columns with their labels corresponding to compound IDs and each matrix element value corresponding to similarity or distance between corresponding compounds; generate text files containing rows containing compoundIDs for two compounds followed by similarity or distance value between these compounds.

Possible values: RowsAndColumns, or IDPairsAndValue. Default value: RowsAndColumns.

Example of RowsAndColumns OutMatrixFormat:

"","Cmpd1","Cmpd2","Cmpd3","Cmpd4","Cmpd5","Cmpd6",... ...
"Cmpd1","1","0.04","0.25","0.13","0.11","0.2",... ...
"Cmpd2","0.04","1","0.06","0.05","0.19","0.07",... ...
"Cmpd3","0.25","0.06","1","0.12","0.22","0.25",... ...
"Cmpd4","0.13","0.05","0.12","1","0.11","0.13",... ...
"Cmpd5","0.11","0.19","0.22","0.11","1","0.17",... ...
"Cmpd6","0.2","0.07","0.25","0.13","0.17","1",... ...

Example of IDPairsAndValue OutMatrixFormat:

"CmpdID1","CmpdID2","Coefficient Value"
"Cmpd1","Cmpd1","1"
"Cmpd1","Cmpd2","0.04"
"Cmpd1","Cmpd3","0.25"
... ... ..
... ... ..
... ... ..
"Cmpd2","Cmpd1","0.04"
"Cmpd2","Cmpd2","1"
"Cmpd2","Cmpd3","0.06"
... ... ..
... ... ..
... ... ..

-o, --overwrite
Overwrite existing files.

-p, --precision number
Precision of calculated values in the output file. Default: up to 2 decimal places. Valid values: positive integers

-q, --quote Yes | No
Put quote around column values in output CSV/TSV text file(s). Possible values: Yes or No. Default value: Yes.

-r, --root RootName
New file name is generated using the root: <Root><BitVectorComparisonMode>.<Ext> or <Root><VectorComparisonMode><VectorComparisonFormulism>.<Ext>. The csv, and tsv <Ext> values are used for comma/semicolon, and tab delimited text files respectively. This option is ignored for multiple input files.

-v, --VectorComparisonMode All | ''TanimotoSimilairy,[ManhattanDistance,...]''
Specify what similarity or distance coefficients to use for calculating similarity matrices for fingerprint vector strings data values in TextFile(s): calculate similarity matrices for all supported similarity and distance coefficients or specify a comma delimited list of similarity and distance coefficients. Possible values: All | ''TanimotoSimilairy,[ManhattanDistance,..]''. Default: TanimotoSimilarity.

The value of -v, --VectorComparisonMode, in conjunction with --VectorComparisonFormulism, decides which type of similarity and distance coefficient formulism gets used.

All uses complete list of supported similarity and distance coefficients: CosineSimilarity, CzekanowskiSimilarity, DiceSimilarity, OchiaiSimilarity, JaccardSimilarity, SorensonSimilarity, TanimotoSimilarity, CityBlockDistance, EuclideanDistance, HammingDistance, ManhattanDistance, SoergelDistance. These similarity and distance coefficients are described below.

FingerprintsVector.pm module, used to calculate similarity and distance coefficients, provides support to perform comparison between vectors containing three different types of values:

Type I: OrderedNumericalValues

. Size of two vectors are same
. Vectors contain real values in a specific order. For example: MACCS keys count, Topological pharmacophore atom pairs and so on.

Type II: UnorderedNumericalValues

. Size of two vectors might not be same
. Vectors contain unordered real value identified by value IDs. For example: Topological atom pairs, Topological atom torsions and so on

Type III: AlphaNumericalValues

. Size of two vectors might not be same
. Vectors contain unordered alphanumerical values. For example: Extended connectivity fingerprints, atom neighborhood fingerprints.

Before performing similarity or distance calculations between vectors containing UnorderedNumericalValues or AlphaNumericalValues, the vectors are transformed into vectors containing unique OrderedNumericalValues using value IDs for UnorderedNumericalValues and values itself for AlphaNumericalValues.

Three forms of similarity and distance calculation between two vectors, specified using --VectorComparisonFormulism option, are supported: AlgebraicForm, BinaryForm or SetTheoreticForm.

For BinaryForm, the ordered list of processed final vector values containing the value or count of each unique value type is simply converted into a binary vector containing 1s and 0s corresponding to presence or absence of values before calculating similarity or distance between two vectors.

For two fingerprint vectors A and B of same size containing OrderedNumericalValues, let:

N = Number values in A or B
Xa = Values of vector A
Xb = Values of vector B
Xai = Value of ith element in A
Xbi = Value of ith element in B
SUM = Sum of i over N values

For SetTheoreticForm of calculation between two vectors, let:

SetIntersectionXaXb = SUM ( MIN ( Xai, Xbi ) )
SetDifferenceXaXb = SUM ( Xai ) + SUM ( Xbi ) - SUM ( MIN ( Xai, Xbi ) )

For BinaryForm of calculation between two vectors, let:

Na = Number of bits set to "1" in A = SUM ( Xai )
Nb = Number of bits set to "1" in B = SUM ( Xbi )
Nc = Number of bits set to "1" in both A and B = SUM ( Xai * Xbi )
Nd = Number of bits set to "0" in both A and B = SUM ( 1 - Xai - Xbi + Xai * Xbi)
N = Number of bits set to "1" or "0" in A or B = Size of A or B = Na + Nb - Nc + Nd

Additionally, for BinaryForm various values also correspond to:

Na = | Xa |
Nb = | Xb |
Nc = | SetIntersectionXaXb |
Nd = N - | SetDifferenceXaXb |
| SetDifferenceXaXb | = N - Nd = Na + Nb - Nc + Nd - Nd = Na + Nb - Nc = | Xa | + | Xb | - | SetIntersectionXaXb |

Various distance and similarity coefficients [ Ref 40, Ref 62, Ref 64 ] for a pair of vectors A and B in AlgebraicForm, BinaryForm and SetTheoreticForm are defined as follows:

CityBlockDistance: ( same as HammingDistance and ManhattanDistance)

AlgebraicForm: SUM ( ABS ( Xai - Xbi ) )

BinaryForm: ( Na - Nc ) + ( Nb - Nc ) = Na + Nb - 2 * Nc

SetTheoreticForm: | SetDifferenceXaXb | - | SetIntersectionXaXb | = SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) )

CosineSimilarity: ( same as OchiaiSimilarityCoefficient)

AlgebraicForm: SUM ( Xai * Xbi ) / SQRT ( SUM ( Xai ** 2) * SUM ( Xbi ** 2) )

BinaryForm: Nc / SQRT ( Na * Nb)

SetTheoreticForm: | SetIntersectionXaXb | / SQRT ( |Xa| * |Xb| ) = SUM ( MIN ( Xai, Xbi ) ) / SQRT ( SUM ( Xai ) * SUM ( Xbi ) )

CzekanowskiSimilarity: ( same as DiceSimilarity and SorensonSimilarity)

AlgebraicForm: ( 2 * ( SUM ( Xai * Xbi ) ) ) / ( SUM ( Xai ** 2) + SUM ( Xbi **2 ) )

BinaryForm: 2 * Nc / ( Na + Nb )

SetTheoreticForm: 2 * | SetIntersectionXaXb | / ( |Xa| + |Xb| ) = 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) )

DiceSimilarity: ( same as CzekanowskiSimilarity and SorensonSimilarity)

AlgebraicForm: ( 2 * ( SUM ( Xai * Xbi ) ) ) / ( SUM ( Xai ** 2) + SUM ( Xbi **2 ) )

BinaryForm: 2 * Nc / ( Na + Nb )

SetTheoreticForm: 2 * | SetIntersectionXaXb | / ( |Xa| + |Xb| ) = 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) )

EuclideanDistance:

AlgebraicForm: SQRT ( SUM ( ( ( Xai - Xbi ) ** 2 ) ) )

BinaryForm: SQRT ( ( Na - Nc ) + ( Nb - Nc ) ) = SQRT ( Na + Nb - 2 * Nc )

SetTheoreticForm: SQRT ( | SetDifferenceXaXb | - | SetIntersectionXaXb | ) = SQRT ( SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) )

HammingDistance: ( same as CityBlockDistance and ManhattanDistance)

AlgebraicForm: SUM ( ABS ( Xai - Xbi ) )

BinaryForm: ( Na - Nc ) + ( Nb - Nc ) = Na + Nb - 2 * Nc

SetTheoreticForm: | SetDifferenceXaXb | - | SetIntersectionXaXb | = SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) )

JaccardSimilarity: ( same as TanimotoSimilarity)

AlgebraicForm: SUM ( Xai * Xbi ) / ( SUM ( Xai ** 2 ) + SUM ( Xbi ** 2 ) - SUM ( Xai * Xbi ) )

BinaryForm: Nc / ( ( Na - Nc ) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc )

SetTheoreticForm: | SetIntersectionXaXb | / | SetDifferenceXaXb | = SUM ( MIN ( Xai, Xbi ) ) / ( SUM ( Xai ) + SUM ( Xbi ) - SUM ( MIN ( Xai, Xbi ) ) )

ManhattanDistance: ( same as CityBlockDistance and HammingDistance)

AlgebraicForm: SUM ( ABS ( Xai - Xbi ) )

BinaryForm: ( Na - Nc ) + ( Nb - Nc ) = Na + Nb - 2 * Nc

SetTheoreticForm: | SetDifferenceXaXb | - | SetIntersectionXaXb | = SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) )

OchiaiSimilarity: ( same as CosineSimilarity)

AlgebraicForm: SUM ( Xai * Xbi ) / SQRT ( SUM ( Xai ** 2) * SUM ( Xbi ** 2) )

BinaryForm: Nc / SQRT ( Na * Nb)

SetTheoreticForm: | SetIntersectionXaXb | / SQRT ( |Xa| * |Xb| ) = SUM ( MIN ( Xai, Xbi ) ) / SQRT ( SUM ( Xai ) * SUM ( Xbi ) )

SorensonSimilarity: ( same as CzekanowskiSimilarity and DiceSimilarity)

AlgebraicForm: ( 2 * ( SUM ( Xai * Xbi ) ) ) / ( SUM ( Xai ** 2) + SUM ( Xbi **2 ) )

BinaryForm: 2 * Nc / ( Na + Nb )

SetTheoreticForm: 2 * | SetIntersectionXaXb | / ( |Xa| + |Xb| ) = 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) )

SoergelDistance:

AlgebraicForm: SUM ( ABS ( Xai - Xbi ) ) / SUM ( MAX ( Xai, Xbi ) )

BinaryForm: 1 - Nc / ( Na + Nb - Nc ) = ( Na + Nb - 2 * Nc ) / ( Na + Nb - Nc )

SetTheoreticForm: ( | SetDifferenceXaXb | - | SetIntersectionXaXb | ) / | SetDifferenceXaXb | = ( SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) - SUM ( MIN ( Xai, Xbi ) ) )

TanimotoSimilarity: ( same as JaccardSimilarity)

AlgebraicForm: SUM ( Xai * Xbi ) / ( SUM ( Xai ** 2 ) + SUM ( Xbi ** 2 ) - SUM ( Xai * Xbi ) )

BinaryForm: Nc / ( ( Na - Nc ) + ( Nb - Nc ) + Nc ) = Nc / ( Na + Nb - Nc )

SetTheoreticForm: | SetIntersectionXaXb | / | SetDifferenceXaXb | = SUM ( MIN ( Xai, Xbi ) ) / ( SUM ( Xai ) + SUM ( Xbi ) - SUM ( MIN ( Xai, Xbi ) ) )

--VectorComparisonFormulism All | ''AlgebraicForm,[BinaryForm,SetTheoreticForm]''
Specify fingerprints vector comparison formulism to use for calculation similarity and distance coefficients during -v, --VectorComparisonMode: use all supported comparison formulisms or specify a comma delimited. Possible values: All | ''AlgebraicForm,[BinaryForm,SetTheoreticForm]''. Default value: AlgebraicForm.

All uses all three forms of supported vector comparison formulism for values of -v, --VectorComparisonMode option.

For fingerprint vector strings containing AlphaNumericalValues data values - ExtendedConnectivityFingerprints, AtomNeighborhoodsFingerprints and so on - all three formulism result in same value during similarity and distance calculations.

-w, --WorkingDir DirName
Location of working directory. Default: current directory.

EXAMPLES

To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create a SampleFPHexTanimotoSimilarity.csv file containing sequentially generated compound IDs with Cmpd prefix, type:

% SimilarityMatrixSDFiles.pl -o SampleFPHex.sdf

To generate a similarity matrix corresponding to Tanimoto similarity coefficient using algebraic formulism for fingerprints vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create a SampleFPCountTanimotoSimilarityAlgebraicForm.csv file containing sequentially generated compound IDs with Cmpd prefix, type:

% SimilarityMatrixSDFiles.pl -o SampleFPCount.sdf

To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create a SampleFPHexTanimotoSimilarity.csv file in IDPairsAndValue format containing sequentially generated compound IDs with Cmpd prefix, type:

% SimilarityMatrixSDFiles.pl --OutMatrixFormat IDPairsAndValue -o SampleFPHex.sdf

To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create a SampleFPHexTanimotoSimilarity.csv file containing compound IDs from mol name line, type:

% SimilarityMatrixSDFiles.pl --CompoundIDMode MolName -o SampleFPHex.sdf

To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create a SampleFPHexTanimotoSimilarity.csv file containing compound IDs from data field name Mol_ID, type:

% SimilarityMatrixSDFiles.pl --CompoundIDMode Data Field --CompoundID Mol_ID -o SampleFPBin.sdf

To generate similarity matrices corresponding to Buser, Dice and Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create SampleFPBin[CoefficientName]Similarity.csv files containing sequentially generated compound IDs with Cmpd prefix, type:

% SimilarityMatrixSDFiles.pl -b "BuserSimilarity,DiceSimilarity, TanimotoSimilarity" -o SampleFPBin.sdf

To generate similarity matrices corresponding to CityBlock distance Tanimoto similarity coefficients using algebraic formulism for fingerprints vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create SampleFPCount[CoefficientName]AlgebraicForm.csv files containing sequentially generated compound IDs with Cmpd prefix, type:

% SimilarityMatrixSDFiles.pl -v "CityBlockDistance,TanimotoSimilarity" -o SampleFPCount.sdf

To generate similarity matrices corresponding to CityBlock distance Tanimoto similarity coefficients using binary formulism for fingerprints vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create SampleFPCount[CoefficientName]Binary.csv files containing sequentially generated compound IDs with Cmpd prefix, type:

% SimilarityMatrixSDFiles.pl -v "CityBlockDistance,TanimotoSimilarity" --VectorComparisonFormulism BinaryForm -o SampleFPCount.sdf

To generate similarity matrices corresponding to CityBlock distance Tanimoto similarity coefficients using all supported comparison formulisms for fingerprints vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create SampleFPCount[CoefficientName][FormulismName].csv files containing sequentially generated compound IDs with Cmpd prefix, type:

% SimilarityMatrixSDFiles.pl -v "CityBlockDistance,TanimotoSimilarity" --VectorComparisonFormulism All -o SampleFPCount.sdf

To generate similarity matrices corresponding to all available similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create SampleFPHex[CoefficientName].csv files containing sequentially generated compound IDs with Cmpd prefix, type

% SimilarityMatrixSDFiles.pl -m AutoDetect --BitVectorComparisonMode All --alpha 0.5 -beta 0.5 -o SampleFPHex.sdf

To generate similarity matrices corresponding to all available similarity and distance coefficients using all comparison formulism for fingerprints vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create SampleFPCount[CoefficientName][FormulismName].csv files containing sequentially generated compound IDs with Cmpd prefix, type:

% SimilarityMatrixSDFiles.pl -m AutoDetect --VectorComparisonMode All --VectorComparisonFormulism All -o SampleFPCount.sdf

To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field name Fingerprints and create a SampleFPHexTanimotoSimilarity.csv file containing compound IDs present in data field name Mol_ID, type:

% SimilarityMatrixSDFiles.pl --FingerprintsField Fingerprints --CompoundIDMode Data Field --CompoundID Mol_ID -o SampleFPHex.sdf

To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create a SampleFPHexTanimotoSimilarity.csv file containing compound IDs from molname line or sequentially generated compound IDs with Mol prefix, type:

% SimilarityMatrixSDFiles.pl --CompoundIDMode MolnameOrLabelPrefix --CompoundID Mol -o SampleFPHex.sdf

To generate a similarity matrix corresponding to Tanimoto similarity coefficient for fingerprints bit-vector strings data corresponding to supported fingerprints present in a data field with Fingerprint substring in its label and create a SampleFPHexTanimotoSimilarity.tsv file containing sequentially generated compound IDs with Cmpd prefix, type:

% SimilarityMatrixSDFiles.pl -OutDelim Tab --quote No -o SampleFPHex.sdf

AUTHOR

Manish Sud

SEE ALSO

InfoFingerprintsTextFiles.plInfoFingerprintsSDFiles.plSimilarityMatrixTextFiles.plAtomNeighborhoodsFingerprints.plExtendedConnectivityFingerprints.plMACCSKeysFingerprints.plPathLengthFingerprints.plTopologicalAtomPairsFingerprints.plTopologicalAtomTorsionsFingerprints.plTopologicalPharmacophoreAtomPairsFingerprints.plTopologicalPharmacophoreAtomTripletsFingerprints.pl

COPYRIGHT

Copyright (C) 2004-2010 Manish Sud. All rights reserved.

This file is part of MayaChemTools.

MayaChemTools is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

 

 

Previous  TOC  NextJuly 5, 2010SimilarityMatrixSDFiles.pl