![]() |
ExtractFromSDFiles.pl - Extract specific data from SDFile(s)
ExtractFromSDFiles.pl SDFile(s)...
ExtractFromSDFiles.pl [-h, --help] [-d, --datafields ''fieldlabel,...'' | ''fieldlabel,value,criteria...'' | ''fieldlabel,value,value...''] [--datafieldsfile filename] [--indelim comma | tab | semicolon] [-m, --mode alldatafields | commondatafields | datafields | datafieldsbyvalue | datafieldsbyregex | datafieldbylist | datafielduniquebylist | molnames | randomcmpds | recordnum | recordrange | 2dcmpdrecords | 3dcmpdrecords ] [-n, --numofcmpds number] [--outdelim comma | tab | semicolon] [--output SD | text | both] [-o, --overwrite] [-q, --quote yes | no] [--record recnum | startrecnum,endrecnum] --RegexIgnoreCase yes or no [-r, --root rootname] [-s, --seed number] [--StrDataString yes | no] [--StrDataStringDelimiter text] [--StrDataStringMode StrOnly | StrAndDataFields] [--ValueComparisonMode Numeric | Alphanumeric] [-v, --violations- number] [-w, --workingdir dirname] SDFile(s)...
Extract specific data from SDFile(s) and generate appropriate SD or CSV/TSV text file(s). The structure data from SDFile(s) is not transferred to CSV/TSV text file(s). Multiple SDFile names are separated by spaces. The valid file extensions are .sdf and .sd. All other file names are ignored. All the SD files in a current directory can be specified either by *.sdf or the current directory name.
For datafields mode, input value format is: fieldlabel,.... Examples:
For datafieldsbyvalue mode, input value format contains these triplets: fieldlabel,value, criteria.... Possible values for criteria: le, ge or eq. The values of --ValueComparisonMode indicates whether values are compared numerical or string comarison operators. Default is to consider data field values as numerical values and use numerical comparison operators. Examples:
For datafieldsbyregex mode, input value format contains these triplets: fieldlabel,regex, criteria.... regex corresponds to any valid regular expression and is used to match the values for specified fieldlabel. Possible values for criteria: eq or ne. During eq and ne values, data field label value is matched with regular expression using =~ and !~ respectively. --RegexIgnoreCase option value is used to determine whether to ignore letter upper/lower case during regular expression match. Examples:
For datafieldbylist and datafielduniquebylist mode, input value format is: fieldlabel,value1,value2.... This is equivalent to datafieldsbyvalue mode with this input value format:fieldlabel,value1,eq,fieldlabel,value2,eq,.... For datafielduniquebylist mode, only unique compounds identified by first occurrence of value associated with fieldlabel in SDFile(s) are kept; any subsequent compounds are simply ignored.
For datafields mode, input file lines contain comma delimited field labels: fieldlabel,.... Example:
For datafieldsbyvalue mode, input file lines contains these comma separated triplets: fieldlabel,value, criteria. Possible values for criteria: le, ge or eq. Examples:
For datafieldbylist and datafielduniquebylist mode, input file line format is:
For datafielduniquebylist mode, only unique compounds identified by first occurrence of value associated with fieldlabel in SDFile(s) are kept; any subsequent compounds are simply ignored. Example:
For alldatafields and molnames mode, only a CSV/TSV text file is generated; for all other modes, however, a SD file is generated by default - you can change the behavior to genereate text file using --output option.
For 3DCmpdRecords mode, only those compounds with at least one non-zero value for Z atomic coordinates are retrieved; however, during retrieval of compounds in 2DCmpdRecords mode, all Z atomic coordinates must be zero.
The value of StrDataStringDelimiter option is used as a delimiter to join structure data lines into a structure data string.
This option is ignored during generation of SD file(s).
This option is ignored during generation of SD file(s).
The value of StrDataStringDelimiter option is used as a delimiter to join structure data lines into a structure data string.
This option is ignored during generation of SD file(s).
To retrieve all data fields from SD files and generate CSV text files, type:
To retrieve all data fields from SD file and generate CSV text files containing a column with structure data as a string with | as line delimiter, type:
To retrieve MOL_ID data fileld from SD file and generate CSV text files containing a column with structure data along with all data fields as a string with | as line delimiter, type:
To retrieve common data fields which exists for all the compounds in a SD file and generate a TSV text file NewSample.tsv, type:
To retrieve MolId, ExtReg, and CompoundName data field from a SD file and generate a CSV text file NewSample.csv, type:
To retrieve compounds from a SD which meet a specific set of criteria - MolWt <= 450, LogP <= 5 and SumNO < 10 - from a SD file and generate a new SD file NewSample.sdf, type:
To retrive compounds from a SD file with a specific set of values for MolID and generate a new SD file NewSample.sdf, type:
To retrive 10 random compounds from a SD file and generate a new SD file RandomSample.sdf, type:
To retrive compound record number 10 from a SD file and generate a new SD file NewSample.sdf, type:
To retrive compound records between 10 to 20 from SD file and generate a new SD file NewSample.sdf, type:
FilterSDFiles.pl, InfoSDFiles.pl, SplitSDFiles.pl, MergeTextFilesWithSD.pl
Copyright (C) 2004-2012 Manish Sud. All rights reserved.
This file is part of MayaChemTools.
MayaChemTools is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.