![]() |
AnalyzeSDFilesData.pl - Analyze numerical data field values in SDFile(s)
AnalyzeSDFilesData.pl SDFile(s)...
AnalyzeSDFilesData.pl [--datafields ''fieldlabel,[fieldlabel,...]'' | All] [--datafieldpairs ''fieldlabel,fieldlabel,[fieldlabel,fieldlabel,...]'' | AllPairs] [-d, --detail infolevel] [-f, --fast] [--frequencybins number | ''number,number,[number,...]''] [-h, --help] [--klargest number] [--ksmallest number] [-m, --mode DescriptiveStatisticsBasic | DescriptiveStatisticsAll | All | ''function1, [function2,...]''] [--trimfraction number] [-w, --workingdir dirname] SDFiles(s)...
Analyze numerical data field values in SDFile(s) using a combination of various statistical functions; Non-numerical values are simply ignored. For Correlation, RSquare, and Covariance analysis, the count of valid values in specified data field pairs must be same; otherwise, column data field pair is ignored. The file names are separated by space.The valid file extensions are .sdf and .sd. All other file names are ignored. All the SD files in a current directory can be specified either by *.sdf or the current directory name.
For AllPairs value of --datafieldpairs option, all data field label pairs are used for Correlation and Covariance calculations.
Number of bins value along with the smallest and largest value for a column is used to group the column values into different groups.
The bin range list is used to group values for a column into different groups; It must contain values in ascending order. Examples:
The frequency value calculated for a specific bin corresponds to all the column values which are greater than the previous bin value and less than or equal to the current bin value.
DescriptiveStatisticsBasic includes these functions: Count, Maximum, Minimum, Mean, Median, Sum, StandardDeviation, StandardError, Variance.
DescriptiveStatisticsAll, in addition to DescriptiveStatisticsBasic functions, includes: GeometricMean, Frequency, HarmonicMean, KLargest, KSmallest, Kurtosis, Mode, RSquare, Skewness, TrimMean.
All uses complete list of supported functions: Average, AverageDeviation, Correlation, Count, Covariance, GeometricMean, Frequency, HarmonicMean, KLargest, KSmallest, Kurtosis, Maximum, Minimum, Mean, Median, Mode, RSquare, Skewness, Sum, SumOfSquares, StandardDeviation, StandardDeviationN, StandardError, StandardScores, StandardScoresN, TrimMean, Variance, VarianceN. The function names ending with N calculate corresponding values assuming an entire population instead of a population sample. Here are the formulas for these functions:
Average: See Mean
AverageDeviation: SUM( ABS(x[i] - Xmean) ) / n
Correlation: See Pearson Correlation
Covariance: SUM( (x[i] - Xmean)(y[i] - Ymean) ) / n
GeometricMean: NthROOT( PRODUCT(x[i]) )
HarmonicMean: 1 / ( SUM(1/x[i]) / n )
Mean: SUM( x[i] ) / n
Median: Xsorted[(n - 1)/2 + 1] for even values of n; (Xsorted[n/2] + Xsorted[n/2 + 1])/2 for odd values of n.
Kurtosis: [ {n(n + 1)/(n - 1)(n - 2)(n - 3)} SUM{ ((x[i] - Xmean)/STDDEV)^4 } ] - {3((n - 1)^2)}/{(n - 2)(n-3)}
PearsonCorrelation: SUM( (x[i] - Xmean)(y[i] - Ymean) ) / SQRT( SUM( (x[i] - Xmean)^2 ) (SUM( (y[i] - Ymean)^2 )) )
RSquare: PearsonCorrelation^2
Skewness: {n/(n - 1)(n - 2)} SUM{ ((x[i] - Xmean)/STDDEV)^3 }
StandardDeviation: SQRT ( SUM( (x[i] - Mean)^2 ) / (n - 1) )
StandardDeviationN: SQRT ( SUM( (x[i] - Mean)^2 ) / n )
StandardError: StandardDeviation / SQRT( n )
StandardScore: (x[i] - Mean) / (n - 1)
StandardScoreN: (x[i] - Mean) / n
Variance: SUM( (x[i] - Xmean)^2 / (n - 1) )
VarianceN: SUM( (x[i] - Xmean)^2 / n )
To calculate basic statistics for data in all common data fields and generate a NewSample1DescriptiveStatisticsBasic.csv file, type:
To calculate basic statistics for MolWeight data field and generate a NewSample1DescriptiveStatisticsBasic.csv file, type:
To calculate all available statistics for MolWeight data field and all data field pairs, and generate NewSample1DescriptiveStatisticsAll.csv, NewSample1CorrelationMatrix.csv, NewSample1CorrelationMatrix.csv, and NewSample1MolWeightFrequencyAnalysis.csv files, type:
To compute frequency distribution of MolWeight data field into five bins and generate NewSample1MolWeightFrequencyAnalysis.csv, type:
To compute frequency distribution of data in MolWeight data field into specified bin range values, and generate NewSample1MolWeightFrequencyAnalysis.csv, type:
To calculate all available statistics for data in all data fields and pairs, type:
FilterSDFiles.pl, InfoSDFiles.pl, SplitSDFiles.pl, MergeTextFilesWithSD.pl
Copyright (C) 2004-2008 Manish Sud. All rights reserved.
This file is part of MayaChemTools.
MayaChemTools is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.