Python
Listing AA sequence entries for a specific strain
If you study bacteria, you should sometimes need amino acid sequences for your strain including mutants. Following codes are what I wrote for that purpose.
Let’s say we want to check UniProt for S. aureus ParC.
We, firstly, need the taxonomy ID for S. aureus. Since UniProt uses the same taxonomy classes as in NCBI, here we access the site with SOAP.
from suds.client import Client wsdl_file = 'http://www.ncbi.nlm.nih.gov/soap/v2.0/eutils.wsdl' client = Client( wsdl_file ) print client
Suds ( https://fedorahosted.org/suds/ ) version: 0.4.1 (beta) build: R703-20101015
Service ( eUtilsService ) tns="http://www.ncbi.nlm.nih.gov/soap/eutils/"
Prefixes (6)
ns0 = "http://www.ncbi.nlm.nih.gov/soap/eutils/egquery"
ns1 = "http://www.ncbi.nlm.nih.gov/soap/eutils/einfo"
ns2 = "http://www.ncbi.nlm.nih.gov/soap/eutils/elink"
ns3 = "http://www.ncbi.nlm.nih.gov/soap/eutils/epost"
ns4 = "http://www.ncbi.nlm.nih.gov/soap/eutils/esearch"
ns5 = "http://www.ncbi.nlm.nih.gov/soap/eutils/esummary"
Ports (1):
(eUtilsServiceSoap)
Methods (7):
run_eGquery(xs:string term, xs:string tool, xs:string email, )
run_eInfo(xs:string db, xs:string tool, xs:string email, )
run_eLink(xs:string db, xs:string[] id, xs:string reldate, xs:string mindate, xs:string maxdate, xs:string datetype, xs:string term, xs:string dbfrom, xs:string linkname, xs:string WebEnv, xs:string query_key, xs:string cmd, xs:string tool, xs:string email, )
run_ePost(xs:string db, xs:string id, xs:string WebEnv, xs:string tool, xs:string email, )
run_eSearch(xs:string db, xs:string term, xs:string WebEnv, xs:string QueryKey, xs:string usehistory, xs:string tool, xs:string email, xs:string field, xs:string reldate, xs:string mindate, xs:string maxdate, xs:string datetype, xs:string RetStart, xs:string RetMax, xs:string rettype, xs:string sort, )
run_eSpell(xs:string db, xs:string term, xs:string tool, xs:string email, )
run_eSummary(xs:string db, xs:string id, xs:string WebEnv, xs:string query_key, xs:string retstart, xs:string retmax, xs:string tool, xs:string email, )
Types (34):
ns1:DbInfoType
ns1:DbListType
ns5:DocSumType
ns4:ErrorListType
ns1:FieldListType
ns1:FieldType
ns2:FirstCharsType
ns2:IconUrlType
ns2:IdCheckListType
ns2:IdLinkSetType
ns2:IdListType
ns4:IdListType
ns2:IdType
ns2:IdUrlListType
ns2:IdUrlSetType
ns3:InvalidIdListType
ns5:ItemType
ns2:LinkInfoType
ns1:LinkListType
ns2:LinkSetDbHistoryType
ns2:LinkSetDbType
ns2:LinkSetType
ns1:LinkType
ns2:LinkType
ns2:ObjUrlType
ns2:ProviderType
ns0:ResultItemType
ns4:TermSetType
ns4:TranslationSetType
ns4:TranslationStackType
ns4:TranslationType
ns2:UrlType
ns4:WarningListType
ns0:eGQueryResultType
From the output above we are able to know that the service provides the method, run_eSearch(). The meanings of its arguments are listed in http://www.ncbi.nlm.nih.gov/books/NBK43082/.
Specifying the taxonomy DB from NCBI, you can easily get the ID of S.aureus.
reply = client.service.run_eSearch(db='taxonomy', term='Staphylococcus aureus[all names]')
print reply
if (reply.Count>0):
taxoID = reply.IdList[0][0]
(reply){
Count = "1"
RetMax = "1"
RetStart = "0"
IdList =
(IdListType){
Id[] =
"1280",
}
TranslationSet = ""
TranslationStack =
(TranslationStackType){
TermSet =
(TermSetType){
Term = "Staphylococcus aureus[all names]"
Field = "all names"
Count = "1"
Explode = "N"
}
OP = "GROUP"
}
QueryTranslation = "Staphylococcus aureus[all names]"
}
Next, we post a query to UniProt as we did in here.
import urllib
args = {
'query':'gene:ParC taxonomy:%s active:yes reviewed:yes' % taxoID,
'format':'xml',
'offset':0
}
encoded_args = urllib.urlencode( args )
url = "http://www.uniprot.org/uniprot/"
xml_data = urllib.urlopen( url, data = encoded_args )
The returned XML will be analyzed accordingly.
from xml.dom import minidom
dom = minidom.parse( xml_data )
entries = dom.getElementsByTagName("entry")
for entry in entries:
ent_name = entry.getElementsByTagName("name")
print ent_name[0].childNodes[0].data
organism = entry.getElementsByTagName("organism")
sci_name = organism[0].getElementsByTagName("name")
print sci_name[0].childNodes[0].data
PARC_STAA8 Staphylococcus aureus (strain NCTC 8325) PARC_STAAC Staphylococcus aureus (strain COL) PARC_STAAM Staphylococcus aureus (strain Mu50 / ATCC 700699) PARC_STAAN Staphylococcus aureus (strain N315) PARC_STAAR Staphylococcus aureus (strain MRSA252) PARC_STAAS Staphylococcus aureus (strain MSSA476) PARC_STAAU Staphylococcus aureus PARC_STAAW Staphylococcus aureus (strain MW2)
We’ve got eight entries.
Exporting eps files from sdfile
I learned that pybel is able to export svg files. But when I tried to convert the svg files into eps files programatically, I found that it was not so easy. After looking around so many pages, I finally found the way.
import pybel
import rsvg, cairo
mols = list( pybel.readfile("sdf","approved.sdf") )
print "Compound Total:",len(mols)
molnames = []
for mol in mols[100:105]:
base_name = mol.data['DRUGBANK_ID']
mol.title = base_name
# make svg handle
svg_data = mol.write("svg")
h_svg = rsvg.Handle(data=svg_data)
# make cairo context
srf = cairo.PSSurface(base_name + ".eps", h_svg.props.width, h_svg.props.height)
srf.set_eps(True)
ctx = cairo.Context( srf )
# make the rsvg handle draw on the cairo context
h_svg.render_cairo(ctx)
#finish drawing
srf.finish()
If you need png files, you can replace the rendering part with this:
srf.write_to_png(base_name + ".png")
Altanatively, you can create PostScript files with rdkit.
from rdkit.Chem import SDMolSupplier,AllChem, Draw
suppl = SDMolSupplier("approved.sdf")
for i in range(100,105):
mol = suppl[i]
t = AllChem.Compute2DCoords(mol)
base_name = mol.GetProp('DRUGBANK_ID')
Draw.MolToFile(mol, base_name + ".ps", imageType="ps")
This is much simpler. But you have to convert the ps-files with another tool like “ps2eps”, which is not what I meant.
Displaying chemical structures on IPython Notebook
IPython is very useful for research because you can record everything what you were thinking at that time. And this is more important: it’s simply fun to play around with!
I found some people mentioned about how to display chemical structures on IPython Notebook:
http://baoilleach.blogspot.jp/2012_03_01_archive.html
http://code.google.com/p/rdkit/wiki/IPythonIntegration
But what I wanted to do was to display all the structures in a sdfile. And, if possible, I wanted to show them in vector format. So I wrote this:
from IPython.core.display import display_svg, SVG
import pybel
mols = list( pybel.readfile("sdf","approved.sdf") )
print "Approved Drugs TOTAL:",len(mols)
molimgs = []
for mol in mols[100:105]:
mol.removeh()
molimgs.append( SVG(mol.write("svg")) )
display_svg( *molimgs )
Thd sdfile was downloaded from DrugBank(http://www.drugbank.ca/).
Blasting EBI services 2
Once you get to know how the services work in EBI, it’s easy to use them. Let’s retrieve some amino acid sequence data from the database.
from suds.client import Client
wsdfUrl = 'http://www.ebi.ac.uk/ws/services/WSDbfetch?wsdl'
clw2Url = 'http://www.ebi.ac.uk/Tools/services/soap/clustalw2?wsdl'
client = Client(wsdfUrl)
ids ="PARC_BACSU PARC_BORBU PARC_CAUCR PARC_ECO57 PARC_ECOLI"
seqs = client.service.fetchBatch("uniprotkb", ids, format="fasta", style="raw")
print seqs
>sp|Q45066|PARC_BACSU DNA topoisomerase 4 subunit A OS=Bacillus subtilis (strain 168) GN=parC PE=3 SV=2 MSQPELFHDLPLEEVIGDRFGRYSKYIIQDRALPDARDGLKPVQRRILYAMHTDGNTFDK NFRKAAKTVGNVIGNYHPHGDSSVYEAMVRMSQDWKVRNVLIEMHGNNGSIDGDPPAAMR YTEARLSPIASELLRDIDKNTVEFVPNFDDTSKEPVVLPAMFPNLLVNGSTGISAGYATD IPPHHLGEVIDAVIKRIQMPSCSVDELMEVIKGPDFPTGGIIQGVDGIRKAYETGKGKII IRGKAEIETIRGGREQIVITEIPFEVNKANLVKKMDEFRIDKKVEGISEVRDETDRTGLR VVIELKKEADAKGILNFLYKNTDLQITYNFNMVAIHNRRPMLMSLPSILDAYIGHQKEVV TNRSVYELQKAKDRHHIVEGLMKALSILDEVIATIRSSSDKRDAKNNLIAKYEFTEPQAE AIVSLQLYRLTNTDITALKEEAEELGKKIEELESILSNDKKLLKVITNSLKALKKKYADT RRSVIEEKIEEIKINLEVMVASEDVYVTVTKDGYLKRTSQRSFAASNGQDFGMKDTDRML HQFEMNTTDVLLLFTNKGSYIYCPVHQLPDIRWKDMGQHFSNLITIDRDETIVKAIPIKE FDPSAYLLFFTKNGMVKKTELTHYKAQRYSKALVALNLKGEDELIDVHVTNGESQIFMAT HLGYGLWFGEDEVNVVGARAAGVKGINLKEDDFVVSGEILQQSDSIVLFTQRGAVKRMSL SEFEKTSRAKRGVVMLRELKKNPHRVVALFACGLEQRLMAETEKGDRKELQTKELRTNDR YSNGSFFFDEEESGKVTAVWRLHTEQ >sp|O51066|PARC_BORBU DNA topoisomerase 4 subunit A OS=Borrelia burgdorferi (strain ATCC 35210 / B31 / CIP 102532 / DSM 4680) GN=parC PE=3 SV=1 MDIRTVLKDNFLQYSSYVIKDRAIASVVDGFKPVQRRIIHSLFEMHDGNFHKVANVVGNT MKYHPHGDTSIYEALVNIANKDLFIEKQGNFGNLFTGDPASASRYIECRLTPLAFDVLYS KEITIYESSYDGRNNEPLLYPAKIPVILIQGSEGIAVGMAAKILPHNFNEILNAVKSELL GESYDIYPDFPTGGIVDVNEYADGNGKVLVRAKIETIDEKTIVIRELPFGETTESLISSI EKAIRKNYIKVSSINDFTAENVAIELSLPRGVYASEVIEKLYHYTNCQISISVNLLLLSE RYPVVYTIKDLIKFHAAHLQKILKMELELQKSKILEKIFYKTLEQIFIEKKIYKLLETIS KEENILSIILSEVLRHKESFSREVLKEDVENLLKIPIRKISLFDIDKNSKDIKILNKELK SINSNISSIRGYSINFIDLLLAKYSKEHQRKTKISLIKSKNVKEIATKNMKVYLNLAEGF AGTSLFDGEFIGNASYYDKILVFRENSYVLKNIEDKTFIDKKNVCALVYDINNSKEQIFS IIYFNRLDNFYYVKRFKIDKFITDKVYEFLGENDEFVDFSLNPEFVEFSTNKDIVKRIEI DNFMVKSRSSIGKRISSNNLKKVKFK >sp|O54478|PARC_CAUCR DNA topoisomerase 4 subunit A OS=Caulobacter crescentus (strain ATCC 19089 / CB15) GN=parC PE=3 SV=2 MNKPVLPPPGGPDDGDRILDEPLTEALSRRYLAYALSTIGSRALPDVRDGLKPVHRRVLY AMSNMRLNPDAAARKCAKVVGEVMGNFHPHGDASIYDALVRLAQEFSQRIPLVEGQGNFG NIDGDSAAAMRYTECKMTEAAMLLLDGIDEDAVDFRPTYDGQDEEPVVLPSGFPNLLANG SSGIAVGMATSIPPHNAAELIDACQLLLANPDATTADLLEKVPGPDFPTGGVIVESRASL LETYETGRGGVRMRAKWEKEDTGRGTYQIVVTEIPYQVKKSDLVEQLADLIDSKKAALLG DVRDESAEDIRLVLEPKSKNVEPEVLMESLFKLSALESRFPVNINVLDARGTPGVMGIKQ ALMAFLAHRREVLTRRARHRLAKIEARLHILDGLLIAYLNLDEVIRIVRYEDKPKEKLIE TFGLTDIQADAILNTRLRQLAKLEEMEIRREHAELVEERDGILAMLASEAKQWKLVGVGL SEVRAALLKIKHPLDKPRPTGVTGRSVFGEAPQVDADAAIEAMIVREPITIILSERGWIR AAKGKIDDPSELKFKEGDKLGFLVPAETTDKLLIFSSDGRFFTLGCDKLPSARGHGEPVR MMIELDDKVKIIDVFPFKAGRKRILASKGGYGFLMPEEEALANRKAGKQVLNVGNEGAAF CLEAVGDQLAVIGDNGKILIFPLEELPEMPRGKGVKLQAYREGGLRDGLSFNAETGAYWI DTAGRRRDWAEWKEWVGRRAGAGKLVPKGFATNKRFRPK >sp|P0AFI3|PARC_ECO57 DNA topoisomerase 4 subunit A OS=Escherichia coli O157:H7 GN=parC PE=3 SV=1 MSDMAERLALHEFTENAYLNYSMYVIMDRALPFIGDGLKPVQRRIVYAMSELGLNASAKF KKSARTVGDVLGKYHPHGDSACYEAMVLMAQPFSYRYPLVDGQGNWGAPDDPKSFAAMRY TESRLSKYSELLLSELGQGTADWVPNFDGTLQEPKMLPARLPNILLNGTTGIAVGMATDI PPHNLREVAQAAIALIDQPKTTLDQLLDIVQGPDYPTEAEIITSRAEIRKIYENGRGSVR MRAVWKKEDGAVVISALPHQVSGARVLEQIAAQMRNKKLPMVDDLRDESDHENPTRLVIV PRSNRVDMDQVMNHLFATTDLEKSYRINLNMIGLDGRPAVKNLLEILSEWLVFRRDTVRR RLNYRLEKVLKRLHILEGLLVAFLNIDEVIEIIRNEDEPKPALMSRFGLTETQAEAILEL KLRHLAKLEEMKIRGEQSELEKERDQLQGILASERKMNNLLKKELQADAQAYGDDRRSPL QEREEAKAMSEHDMLPSEPVTIVLSQMGWVRSAKGHDIDAPGLNYKAGDSFKAAVKGKSN QPVVFVDSTGRSYAIDPITLPSARGQGEPLTGKLTLPPGATVDHMLMESDDQKLLMASDA GYGFVCTFNDLVARNRAGKALITLPENAHVMPPVVIEDASDMLLAITQAGRMLMFPVSDL PQLSKGKGNKIINIPSAEAARGEDGLAQLYVLPPQSTLTIHVGKRKIKLRPEELQKVTGE RGRRGTLMRGLQRIDRVEIDSPRRASSGDSEE >sp|P0AFI2|PARC_ECOLI DNA topoisomerase 4 subunit A OS=Escherichia coli (strain K12) GN=parC PE=1 SV=1 MSDMAERLALHEFTENAYLNYSMYVIMDRALPFIGDGLKPVQRRIVYAMSELGLNASAKF KKSARTVGDVLGKYHPHGDSACYEAMVLMAQPFSYRYPLVDGQGNWGAPDDPKSFAAMRY TESRLSKYSELLLSELGQGTADWVPNFDGTLQEPKMLPARLPNILLNGTTGIAVGMATDI PPHNLREVAQAAIALIDQPKTTLDQLLDIVQGPDYPTEAEIITSRAEIRKIYENGRGSVR MRAVWKKEDGAVVISALPHQVSGARVLEQIAAQMRNKKLPMVDDLRDESDHENPTRLVIV PRSNRVDMDQVMNHLFATTDLEKSYRINLNMIGLDGRPAVKNLLEILSEWLVFRRDTVRR RLNYRLEKVLKRLHILEGLLVAFLNIDEVIEIIRNEDEPKPALMSRFGLTETQAEAILEL KLRHLAKLEEMKIRGEQSELEKERDQLQGILASERKMNNLLKKELQADAQAYGDDRRSPL QEREEAKAMSEHDMLPSEPVTIVLSQMGWVRSAKGHDIDAPGLNYKAGDSFKAAVKGKSN QPVVFVDSTGRSYAIDPITLPSARGQGEPLTGKLTLPPGATVDHMLMESDDQKLLMASDA GYGFVCTFNDLVARNRAGKALITLPENAHVMPPVVIEDASDMLLAITQAGRMLMFPVSDL PQLSKGKGNKIINIPSAEAARGEDGLAQLYVLPPQSTLTIHVGKRKIKLRPEELQKVTGE RGRRGTLMRGLQRIDRVEIDSPRRASSGDSEE
Align the retrieved sequences, then.
from suds.client import Client
clw2Url = 'http://www.ebi.ac.uk/Tools/services/soap/clustalw2?wsdl'
client = Client(clw2Url)
print client
print client.service.getParameters()
print client.service.getParameterDetails("outorder")
clw2_params = {
"outorder":"input",
"sequence":seqs
}
jobid = client.service.run(email="pachiras@localhost", title="testjob", parameters=clw2_params)
Suds ( https://fedorahosted.org/suds/ ) version: 0.4.1 (beta) build: R703-20101015
Service ( JDispatcherService ) tns="http://soap.jdispatcher.ebi.ac.uk"
Prefixes (1)
ns0 = "http://soap.jdispatcher.ebi.ac.uk"
Ports (1):
(JDispatcherServiceHttpPort)
Methods (6):
getParameterDetails(xs:string parameterId, )
getParameters()
getResult(xs:string jobId, xs:string type, wsRawOutputParameters parameters, )
getResultTypes(xs:string jobId, )
getStatus(xs:string jobId, )
run(xs:string email, xs:string title, InputParameters parameters, )
Types (12):
ArrayOfString
InputParameters
wsParameterDetails
wsParameterValue
wsParameterValues
wsParameters
wsProperties
wsProperty
wsRawOutputParameter
wsRawOutputParameters
wsResultType
wsResultTypes
(wsParameters){
id[] =
"alignment",
"type",
"ktup",
"window",
"score",
"topdiags",
"pairgap",
"pwmatrix",
"pwdnamatrix",
"pwgapopen",
"pwgapext",
"matrix",
"dnamatrix",
"gapopen",
"noendgaps",
"gapext",
"gapdist",
"iteration",
"numiter",
"clustering",
"output",
"outorder",
"sequence",
}
(wsParameterDetails){
name = "Output order"
description = "The order in which the sequences appear in the final alignment"
type = "STRING"
values =
(wsParameterValues){
value[] =
(wsParameterValue){
label = "aligned"
value = "aligned"
defaultValue = True
},
(wsParameterValue){
label = "input"
value = "input"
defaultValue = False
},
}
}
Wait till the calculation finishes.
import base64
import time
status = 'PENDING'
while status == 'RUNNING' or status == 'PENDING':
status = client.service.getStatus(jobid)
print status
if status == 'RUNNING' or status == 'PENDING':
time.sleep(15)
print client.service.getResultTypes(jobid)
result = client.service.getResult(jobid, "aln-clustalw")
dec = base64.b64decode( result )
print dec
FINISHED
(wsResultTypes){
type[] =
(wsResultType){
description = "The output from the tool itself"
fileSuffix = "txt"
identifier = "out"
label = "Tool Output"
mediaType = "text/plain"
},
(wsResultType){
description = "Your input sequences as seen by the tool"
fileSuffix = "txt"
identifier = "sequence"
label = "Input Sequences"
mediaType = "text/plain"
},
(wsResultType){
description = "The alignment in CLUSTAL format"
fileSuffix = "clustalw"
identifier = "aln-clustalw"
label = "Alignment in CLUSTAL format"
mediaType = "text/x-clustalw-alignment"
},
(wsResultType){
description = "The alignment guide tree"
fileSuffix = "dnd"
identifier = "tree"
label = "Guide Tree"
mediaType = "text/x-clustalw-tree"
},
}
CLUSTAL 2.1 multiple sequence alignment
sp|Q45066|PARC_BACSU -----------MSQPELFHDLPLEEVIGDRFGRYSKYIIQDRALPDARDGLKPVQRRILY
sp|O51066|PARC_BORBU --------------------MDIRTVLKDNFLQYSSYVIKDRAIASVVDGFKPVQRRIIH
sp|O54478|PARC_CAUCR MNKPVLPPPGGPDDGDRILDEPLTEALSRRYLAYALSTIGSRALPDVRDGLKPVHRRVLY
sp|P0AFI3|PARC_ECO57 ----------MSDMAERLALHEFTEN---AYLNYSMYVIMDRALPFIGDGLKPVQRRIVY
sp|P0AFI2|PARC_ECOLI ----------MSDMAERLALHEFTEN---AYLNYSMYVIMDRALPFIGDGLKPVQRRIVY
: : *: * .**:. **:***:**:::
sp|Q45066|PARC_BACSU AMHTDGNTFDKNFRKAAKTVGNVIGNYHPHGDSSVYEAMVRMSQDWKVRNVLIEMHGNNG
sp|O51066|PARC_BORBU SLFEM---HDGNFHKVANVVGNTM-KYHPHGDTSIYEALVNIAN----KDLFIEKQGNFG
sp|O54478|PARC_CAUCR AMSNMRLNPDAAARKCAKVVGEVMGNFHPHGDASIYDALVRLAQEFSQRIPLVEGQGNFG
sp|P0AFI3|PARC_ECO57 AMSELGLNASAKFKKSARTVGDVLGKYHPHGDSACYEAMVLMAQPFSYRYPLVDGQGNWG
sp|P0AFI2|PARC_ECOLI AMSELGLNASAKFKKSARTVGDVLGKYHPHGDSACYEAMVLMAQPFSYRYPLVDGQGNWG
:: . :* *..**:.: ::*****:: *:*:* ::: : ::: :** *
sp|Q45066|PARC_BACSU SIDG-DPPAAMRYTEARLSPIASELLRDIDKNTVEFVPNFDDTSKEPVVLPAMFPNLLVN
sp|O51066|PARC_BORBU NLFTGDPASASRYIECRLTPLAFDVL--YSKEITIYESSYDGRNNEPLLYPAKIPVILIQ
sp|O54478|PARC_CAUCR NIDG-DSAAAMRYTECKMTEAAMLLLDGIDEDAVDFRPTYDGQDEEPVVLPSGFPNLLAN
sp|P0AFI3|PARC_ECO57 APDDPKSFAAMRYTESRLSKYSELLLSELGQGTADWVPNFDGTLQEPKMLPARLPNILLN
sp|P0AFI2|PARC_ECOLI APDDPKSFAAMRYTESRLSKYSELLLSELGQGTADWVPNFDGTLQEPKMLPARLPNILLN
.. :* ** *.::: : :* .: . : ..:*. :** : *: :* :* :
sp|Q45066|PARC_BACSU GSTGISAGYATDIPPHHLGEVIDAVIKRIQMPSCSVDELMEVIKGPDFPTGG-IIQGVDG
sp|O51066|PARC_BORBU GSEGIAVGMAAKILPHNFNEILNAVKSELLGESYDIY--------PDFPTGGIVDVN---
sp|O54478|PARC_CAUCR GSSGIAVGMATSIPPHNAAELIDACQLLLANPDATTADLLEKVPGPDFPTGGVIVESRAS
sp|P0AFI3|PARC_ECO57 GTTGIAVGMATDIPPHNLREVAQAAIALIDQPKTTLDQLLDIVQGPDYPTEAEIITSRAE
sp|P0AFI2|PARC_ECOLI GTTGIAVGMATDIPPHNLREVAQAAIALIDQPKTTLDQLLDIVQGPDYPTEAEIITSRAE
*: **:.* *:.* **: *: :* : . **:** . : .
sp|Q45066|PARC_BACSU IRKAYETGKGKIIIRGKAEIETIRGGREQIVITEIPFEVNKANLVKKMDEFRIDKKVEGI
sp|O51066|PARC_BORBU ---EYADGNGKVLVRAKIETIDEK----TIVIRELPFGETTESLISSIEKAIR-KNYIKV
sp|O54478|PARC_CAUCR LLETYETGRGGVRMRAKWEKEDTGRGTYQIVVTEIPYQVKKSDLVEQLADLIDSKKAALL
sp|P0AFI3|PARC_ECO57 IRKIYENGRGSVRMRAVWKKEDG-----AVVISALPHQVSGARVLEQIAAQMRNKKLPMV
sp|P0AFI2|PARC_ECOLI IRKIYENGRGSVRMRAVWKKEDG-----AVVISALPHQVSGARVLEQIAAQMRNKKLPMV
* *.* : :*. : :*: :*. . ::..: *: :
sp|Q45066|PARC_BACSU SEVRDETDRTG--LRVVIELKKEADAKGILNFLYKNTDLQITYNFNMVAIHNR-RPMLMS
sp|O51066|PARC_BORBU SSINDFTAEN---VAIELSLPRGVYASEVIEKLYHYTNCQISISVNLLLLSER-YPVVYT
sp|O54478|PARC_CAUCR GDVRDESAED--IRLVLEPKSKNVEPEVLMESLFKLSALESRFPVNINVLDARGTPGVMG
sp|P0AFI3|PARC_ECO57 DDLRDESDHENPTRLVIVPRSNRVDMDQVMNHLFATTDLEKSYRINLNMIGLDGRPAVKN
sp|P0AFI2|PARC_ECOLI DDLRDESDHENPTRLVIVPRSNRVDMDQVMNHLFATTDLEKSYRINLNMIGLDGRPAVKN
..:.* : . : . . . ::: *: : : .*: : * :
sp|Q45066|PARC_BACSU LPSILDAYIGHQKEVVTNRSVYELQKAKDRHHIVEGLMKALSILDEVIATIRSSSDKRDA
sp|O51066|PARC_BORBU IKDLIKFHAAHLQKILKMELELQKSKILEKIFYKT--LEQIFIEKKIYKLLETISK----
sp|O54478|PARC_CAUCR IKQALMAFLAHRREVLTRRARHRLAKIEARLHILDGLLIAYLNLDEVIRIVRYEDK---P
sp|P0AFI3|PARC_ECO57 LLEILSEWLVFRRDTVRRRLNYRLEKVLKRLHILEGLLVAFLNIDEVIEIIRNEDE---P
sp|P0AFI2|PARC_ECOLI LLEILSEWLVFRRDTVRRRLNYRLEKVLKRLHILEGLLVAFLNIDEVIEIIRNEDE---P
: . : . :. : . . * : . : .:: :. ..
sp|Q45066|PARC_BACSU KNNLIAKYEFTEPQAEAIVSLQLYRLTNTDITALKEEAEELGKKIEELESILSNDKKLLK
sp|O51066|PARC_BORBU ----------EENILSIILSEVLRHKESFSREVLKEDVENLLKIPIRKISLFDIDKNSKD
sp|O54478|PARC_CAUCR KEKLIETFGLTDIQADAILNTRLRQLAKLEEMEIRREHAELVEERDGILAMLASEAKQWK
sp|P0AFI3|PARC_ECO57 KPALMSRFGLTETQAEAILELKLRHLAKLEEMKIRGEQSELEKERDQLQGILASERKMNN
sp|P0AFI2|PARC_ECOLI KPALMSRFGLTETQAEAILELKLRHLAKLEEMKIRGEQSELEKERDQLQGILASERKMNN
: . *:. * : . . :: : :* : .:: : : .
sp|Q45066|PARC_BACSU VIT-------------NSLKALKKKYADTRRSVIEEK-IEEIKINLEVMVASEDVYVTVT
sp|O51066|PARC_BORBU IKILN-----------KELKSINSNISSIRGYSINFIDLLLAKYSKEHQRKTK---ISLI
sp|O54478|PARC_CAUCR LVGVGLSEVRAALLKIKHPLDKPRPTGVTGRSVFGEAPQVDADAAIEAMIVREPITIILS
sp|P0AFI3|PARC_ECO57 LL-------------KKELQADAQAYGDDRRSPLQEREEAKAMSEHD-MLPSEPVTIVLS
sp|P0AFI2|PARC_ECOLI LL-------------KKELQADAQAYGDDRRSPLQEREEAKAMSEHD-MLPSEPVTIVLS
: : . : : : : :
sp|Q45066|PARC_BACSU KDGYLKRTSQRSFAASNGQDFGMKDTDRMLHQFEMNTTDVLLLFTNKGSYIYCPVHQLPD
sp|O51066|PARC_BORBU KSKNVKEIATKNMKVYLNLAEGFAGTSLFDGEFIGNASYYDKILVFRENSYVLKNIEDKT
sp|O54478|PARC_CAUCR ERGWIRAAKGKIDDPS---ELKFKEGDKLGFLVPAETTDKLLIFSSDGRFFTLGCDKLPS
sp|P0AFI3|PARC_ECO57 QMGWVRSAKGHDIDAP---GLNYKAGDSFKAAVKGKSNQPVVFVDSTGRSYAIDPITLPS
sp|P0AFI2|PARC_ECOLI QMGWVRSAKGHDIDAP---GLNYKAGDSFKAAVKGKSNQPVVFVDSTGRSYAIDPITLPS
: :: : . : . ::. :.
sp|Q45066|PARC_BACSU IRWKDMGQHFSNLITIDRDETIVKAIPIKEFDPSAYLLFFTKNGMVKKTELTHYKAQRYS
sp|O51066|PARC_BORBU FIDKKN-----------------------------------------------------V
sp|O54478|PARC_CAUCR ARGHGE------------------------------------------------------
sp|P0AFI3|PARC_ECO57 ARGQGE------------------------------------------------------
sp|P0AFI2|PARC_ECOLI ARGQGE------------------------------------------------------
:
sp|Q45066|PARC_BACSU KALVALNLKGEDELIDVHVTNGESQIFMATHLGYGLWFGEDEVNVVGARAAGVKGINLKE
sp|O51066|PARC_BORBU CALVYDINNSKEQIFSIIYFNRLDNFYYVKRFKIDKFITDKVYEFLGENDE---------
sp|O54478|PARC_CAUCR PVRMMIELDDKVKIIDVFPFKAGRKRILASKGGYGFLMPEE--EALANRKAGKQVLNVGN
sp|P0AFI3|PARC_ECO57 PLTGKLTLPPGATVDHMLMESDDQKLLMASDAGYGFVCTFN--DLVARNRAGKALITLPE
sp|P0AFI2|PARC_ECOLI PLTGKLTLPPGATVDHMLMESDDQKLLMASDAGYGFVCTFN--DLVARNRAGKALITLPE
: : . : .. . . : :. .
sp|Q45066|PARC_BACSU DDFVVSGEILQQS-DSIVLFTQRGAVKRMSLSEFEKTSRAKRGVVMLR---------ELK
sp|O51066|PARC_BORBU ------FVDFSLNPEFVEFSTNKDIVKRIEIDNFMVKSRSSIG-----------------
sp|O54478|PARC_CAUCR EG---AAFCLEAVGDQLAVIGDNGKILIFPLEELPEMPRGKG----VKLQ-------AYR
sp|P0AFI3|PARC_ECO57 NAHVMPPVVIEDASDMLLAITQAGRMLMFPVSDLPQLSKGKG----NKIINIPSAEAARG
sp|P0AFI2|PARC_ECOLI NAHVMPPVVIEDASDMLLAITQAGRMLMFPVSDLPQLSKGKG----NKIINIPSAEAARG
:. : : : . : : :.:: .:..
sp|Q45066|PARC_BACSU KNPHRVVALFACGLEQRLMAETEKGDRKELQTKELRTNDRYSNGSFFFDEEESGKVTAVW
sp|O51066|PARC_BORBU ---------------------------------KRISSNNLKKVKFK-------------
sp|O54478|PARC_CAUCR EGGLRDGLSFNAETGAYWIDTAGR--RRDWAEWKEWVGRRAGAGKLVP-------KGFAT
sp|P0AFI3|PARC_ECO57 EDGLAQLYVLPPQS-TLTIHVGKRKIKLRPEELQKVTGERGRRGTLMRGLQRIDRVEIDS
sp|P0AFI2|PARC_ECOLI EDGLAQLYVLPPQS-TLTIHVGKRKIKLRPEELQKVTGERGRRGTLMRGLQRIDRVEIDS
: . . .:
sp|Q45066|PARC_BACSU RLHTEQ-----
sp|O51066|PARC_BORBU -----------
sp|O54478|PARC_CAUCR NKRFRPK----
sp|P0AFI3|PARC_ECO57 PRRASSGDSEE
sp|P0AFI2|PARC_ECOLI PRRASSGDSEE
This is great! I don’t have to be bothered with the piles of fasta files anymore. All I need is this script.
Blasting EBI Services 1
Although I knew there were some programtical way to access EMBL-EBI databases(http://www.ebi.ac.uk/Tools/webservices/), I had no idea how to do it because I didn’t know about UDDI, WSDL, and SOAP.
After learning the concepts of those technologies I tried some examples in this page (http://www.ebi.ac.uk/Tools/webservices/tutorials/06_programming/python/soap/soappy). The first two example code worked without any modifications. But the third one, “Complex Data Structures”, didn’t work.
The cause of this problem was that the url(http://www.ebi.ac.uk/Tools/webservices/wsdl/WSWUBlast.wsdl) used in the code was not accessible anymore. The right url was “http://www.ebi.ac.uk/Tools/services/soap/fasta?wsdl”. Furthermore, the way to access the service had changed accordingly.
After some struggles, I modified the code as follows.
# Import WSDL package
from SOAPpy import WSDL
# Create service interface
wsdlUrl = 'http://www.ebi.ac.uk/Tools/services/soap/fasta?wsdl'
server = WSDL.Proxy(wsdlUrl)
# For debug
server.soapproxy.config.dumpSOAPOut = 1
server.soapproxy.config.dumpSOAPIn = 1
# Query sequence
seq = """>Q8E5Q5_STRA3
MKLSKRYRFWQKVIKALGVLALIATLVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFI
VAQIVQIVFPVIPGGVTTVAGFLIFGPTLGFIYNYIGIIIGSVILFWLVKFYGRKFVLLF
MDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPI
SIIGYSYLWIYGGDILKNFLN"""
dbnames = {'string':'pdb'}
# Structure containing parameters
blast_params = {
'program':'fasta', # Protein vs. protein search
'database':dbnames, # Database to search
'scores':10,
'alignments':10,
'stype':'protein',
'sequence':seq
}
server.methods['run'].namespace = server.wsdl.targetNamespace
jobid = server.run(email="pachiras@localhost", title="testjob", parameters=blast_params)
The point was to specify the namespace before calling each method. Since server.wsdl.targetNamespace holds the namespace written in the wsdl file( ‘http://www.ebi.ac.uk/Tools/services/soap/fasta?wsdl’), its value is substituted into server.method[‘some_method’].namespace.
With suds library, this code is rewritten in a more simple way.
from suds.client import Client
wsdlUrl = 'http://www.ebi.ac.uk/Tools/services/soap/fasta?wsdl'
client = Client(wsdlUrl)
seq = """MKLSKRYRFWQKVIKALGVLALIATLVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFI
VAQIVQIVFPVIPGGVTTVAGFLIFGPTLGFIYNYIGIIIGSVILFWLVKFYGRKFVLLF
MDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPI
SIIGYSYLWIYGGDILKNFLN"""
dbnames = {'string':'pdb'}
# Structure containing parameters
blast_params = {
'program':'fasta', # Protein vs. protein search
'database':dbnames, # Database to search
'scores':10,
'alignments':10,
'stype':'protein',
'sequence':seq
}
jobid = client.service.run(email="pachiras@localhost", title="testjob", parameters=blast_params)
If you’d like to know what type of methods the service is providing, simply type this.
print client
Suds ( https://fedorahosted.org/suds/ ) version: 0.4.1 (beta) build: R703-20101015
Service ( JDispatcherService ) tns="http://soap.jdispatcher.ebi.ac.uk"
Prefixes (1)
ns0 = "http://soap.jdispatcher.ebi.ac.uk"
Ports (1):
(JDispatcherServiceHttpPort)
Methods (6):
getParameterDetails(xs:string parameterId, )
getParameters()
getResult(xs:string jobId, xs:string type, wsRawOutputParameters parameters, )
getResultTypes(xs:string jobId, )
getStatus(xs:string jobId, )
run(xs:string email, xs:string title, InputParameters parameters, )
Types (12):
ArrayOfString
InputParameters
wsParameterDetails
wsParameterValue
wsParameterValues
wsParameters
wsProperties
wsProperty
wsRawOutputParameter
wsRawOutputParameters
wsResultType
wsResultTypes
In those method getResultTypes() will give you the information about the format in which your result will be returned. So type this.
print client.service.getResultTypes(jobid)
(wsResultTypes){
type[] =
(wsResultType){
description = "The output from the tool itself"
fileSuffix = "txt"
identifier = "out"
label = "Tool Output"
mediaType = "text/plain"
},
(wsResultType){
description = "The M9 output from the tool itself"
fileSuffix = "txt"
identifier = "m9"
label = "Tool M9 Output"
mediaType = "text/plain"
},
(wsResultType){
description = "Your input sequence as seen by the tool"
fileSuffix = "txt"
identifier = "sequence"
label = "Input Sequence"
mediaType = "text/plain"
},
(wsResultType){
description = "The list of the alignment identifiers in the order they appear in the results"
fileSuffix = "txt"
identifier = "ids"
label = "Alignment Identifiers"
mediaType = "text/plain"
},
(wsResultType){
description = "The results of the job in XML (EBI format)"
fileSuffix = "xml"
identifier = "xml"
label = "XML Result"
mediaType = "application/xml"
},
(wsResultType){
description = "A visual representation of the results in SVG format"
fileSuffix = "svg"
identifier = "visual-svg"
label = "Visual Result (SVG)"
mediaType = "image/svg+xml"
},
(wsResultType){
description = "A visual representation of the results in PNG format"
fileSuffix = "png"
identifier = "visual-png"
label = "Visual Result (PNG)"
mediaType = "image/png"
},
(wsResultType){
description = "A visual representation of the results in JPEG format"
fileSuffix = "jpg"
identifier = "visual-jpg"
label = "Visual Result (JPEG)"
mediaType = "image/jpeg"
},
}
It takes some time until you get the result. So you have to wail and poll as in http://www.ebi.ac.uk/Tools/webservices/tutorials/06_programming/python/soap/soappy
import base64
status = 'PENDING'
while status == 'RUNNING' or status == 'PENDING':
status = client.service.getStatus(jobid)
print status
if status == 'RUNNING' or status == 'PENDING':
time.sleep(15)
result = client.service.getResult(jobid, "xml")
dec = base64.b64decode( result )
print dec
FINISHED
<?xml version='1.0' encoding='ISO-8859-1'?>
<EBIApplicationResult xmlns="http://www.ebi.ac.uk/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.ebi.ac.uk/Tools/common/schema/ApplicationResult.xsd">
<Header>
<program name="FASTA" version="36.3.5e Nov, 2012(preload8)"/>
<commandLine command="cat fasta-S20130413-060528-0619-72925383-pg.sequence | /nfs/public/ro/es/appbin/linux-x86_64/fasta-36.3.5e-20121123/fasta36 -l $IDATA_CURRENT/fastacfg/fasta3db -L -T 8 -p -b 10 -d 10 -m "F9 fasta-S20130413-060528-0619-72925383-pg.m9" \@:1- +pdb+"/>
<parameters>
<sequences total="1">
<sequence number="1" name="EMBOSS_001" type="PROTEIN" length="201"/>
</sequences>
<databases total="1" sequences="-1" letters="-1">
<database number="1" name="pdb" type="NUCLEOTIDE"/>
</databases>
<alignments>10</alignments>
<scores>10</scores>
</parameters>
<timeInfo start="2013-04-13T06:05:28+01:00" end="2013-04-13T06:05:32+01:00" search="P0Y0M0DT0H0M4.000S"/>
</Header>
<SequenceSimilaritySearchResult>
<hits total="10">
<hit number="1" database="PDB" id="4HYC_F" length="301" description="mol:protein length:301 Putative uncharacterized protein">
<alignments total="1">
<alignment number="1">
<score>92</score>
<smithWatermanScore>92</smithWatermanScore>
<initn>43</initn>
<init1>43</init1>
<opt>92</opt>
<zScore>128.8</zScore>
<bits>31.3</bits>
<expectation>5.0</expectation>
<identity>27.0</identity>
<positives>66.2</positives>
<overlap>74</overlap>
<strand>none/plus</strand>
<querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
<pattern> :.... .::. ::. :.: .::.. : .:. .. .::.... ...:. :.. ::.:. :. : </pattern>
<matchSeq start="13" end="82"> MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
</alignment>
</alignments>
</hit>
<hit number="2" database="PDB" id="4HYC_E" length="301" description="mol:protein length:301 Putative uncharacterized protein">
<alignments total="1">
<alignment number="1">
<score>92</score>
<smithWatermanScore>92</smithWatermanScore>
<initn>43</initn>
<init1>43</init1>
<opt>92</opt>
<zScore>128.8</zScore>
<bits>31.3</bits>
<expectation>5.0</expectation>
<identity>27.0</identity>
<positives>66.2</positives>
<overlap>74</overlap>
<strand>none/plus</strand>
<querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
<pattern> :.... .::. ::. :.: .::.. : .:. .. .::.... ...:. :.. ::.:. :. : </pattern>
<matchSeq start="13" end="82"> MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
</alignment>
</alignments>
</hit>
<hit number="3" database="PDB" id="4HYD_B" length="301" description="mol:protein length:301 Putative uncharacterized protein">
<alignments total="1">
<alignment number="1">
<score>92</score>
<smithWatermanScore>92</smithWatermanScore>
<initn>43</initn>
<init1>43</init1>
<opt>92</opt>
<zScore>128.8</zScore>
<bits>31.3</bits>
<expectation>5.0</expectation>
<identity>27.0</identity>
<positives>66.2</positives>
<overlap>74</overlap>
<strand>none/plus</strand>
<querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
<pattern> :.... .::. ::. :.: .::.. : .:. .. .::.... ...:. :.. ::.:. :. : </pattern>
<matchSeq start="13" end="82"> MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
</alignment>
</alignments>
</hit>
<hit number="4" database="PDB" id="4HYG_C" length="301" description="mol:protein length:301 Putative uncharacterized protein">
<alignments total="1">
<alignment number="1">
<score>92</score>
<smithWatermanScore>92</smithWatermanScore>
<initn>43</initn>
<init1>43</init1>
<opt>92</opt>
<zScore>128.8</zScore>
<bits>31.3</bits>
<expectation>5.0</expectation>
<identity>27.0</identity>
<positives>66.2</positives>
<overlap>74</overlap>
<strand>none/plus</strand>
<querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
<pattern> :.... .::. ::. :.: .::.. : .:. .. .::.... ...:. :.. ::.:. :. : </pattern>
<matchSeq start="13" end="82"> MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
</alignment>
</alignments>
</hit>
<hit number="5" database="PDB" id="4HYD_D" length="301" description="mol:protein length:301 Putative uncharacterized protein">
<alignments total="1">
<alignment number="1">
<score>92</score>
<smithWatermanScore>92</smithWatermanScore>
<initn>43</initn>
<init1>43</init1>
<opt>92</opt>
<zScore>128.8</zScore>
<bits>31.3</bits>
<expectation>5.0</expectation>
<identity>27.0</identity>
<positives>66.2</positives>
<overlap>74</overlap>
<strand>none/plus</strand>
<querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
<pattern> :.... .::. ::. :.: .::.. : .:. .. .::.... ...:. :.. ::.:. :. : </pattern>
<matchSeq start="13" end="82"> MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
</alignment>
</alignments>
</hit>
<hit number="6" database="PDB" id="4HYD_C" length="301" description="mol:protein length:301 Putative uncharacterized protein">
<alignments total="1">
<alignment number="1">
<score>92</score>
<smithWatermanScore>92</smithWatermanScore>
<initn>43</initn>
<init1>43</init1>
<opt>92</opt>
<zScore>128.8</zScore>
<bits>31.3</bits>
<expectation>5.0</expectation>
<identity>27.0</identity>
<positives>66.2</positives>
<overlap>74</overlap>
<strand>none/plus</strand>
<querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
<pattern> :.... .::. ::. :.: .::.. : .:. .. .::.... ...:. :.. ::.:. :. : </pattern>
<matchSeq start="13" end="82"> MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
</alignment>
</alignments>
</hit>
<hit number="7" database="PDB" id="4HYC_G" length="301" description="mol:protein length:301 Putative uncharacterized protein">
<alignments total="1">
<alignment number="1">
<score>92</score>
<smithWatermanScore>92</smithWatermanScore>
<initn>43</initn>
<init1>43</init1>
<opt>92</opt>
<zScore>128.8</zScore>
<bits>31.3</bits>
<expectation>5.0</expectation>
<identity>27.0</identity>
<positives>66.2</positives>
<overlap>74</overlap>
<strand>none/plus</strand>
<querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
<pattern> :.... .::. ::. :.: .::.. : .:. .. .::.... ...:. :.. ::.:. :. : </pattern>
<matchSeq start="13" end="82"> MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
</alignment>
</alignments>
</hit>
<hit number="8" database="PDB" id="4HYD_A" length="301" description="mol:protein length:301 Putative uncharacterized protein">
<alignments total="1">
<alignment number="1">
<score>92</score>
<smithWatermanScore>92</smithWatermanScore>
<initn>43</initn>
<init1>43</init1>
<opt>92</opt>
<zScore>128.8</zScore>
<bits>31.3</bits>
<expectation>5.0</expectation>
<identity>27.0</identity>
<positives>66.2</positives>
<overlap>74</overlap>
<strand>none/plus</strand>
<querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
<pattern> :.... .::. ::. :.: .::.. : .:. .. .::.... ...:. :.. ::.:. :. : </pattern>
<matchSeq start="13" end="82"> MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
</alignment>
</alignments>
</hit>
<hit number="9" database="PDB" id="4HYC_B" length="301" description="mol:protein length:301 Putative uncharacterized protein">
<alignments total="1">
<alignment number="1">
<score>92</score>
<smithWatermanScore>92</smithWatermanScore>
<initn>43</initn>
<init1>43</init1>
<opt>92</opt>
<zScore>128.8</zScore>
<bits>31.3</bits>
<expectation>5.0</expectation>
<identity>27.0</identity>
<positives>66.2</positives>
<overlap>74</overlap>
<strand>none/plus</strand>
<querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
<pattern> :.... .::. ::. :.: .::.. : .:. .. .::.... ...:. :.. ::.:. :. : </pattern>
<matchSeq start="13" end="82"> MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
</alignment>
</alignments>
</hit>
<hit number="10" database="PDB" id="4HYC_D" length="301" description="mol:protein length:301 Putative uncharacterized protein">
<alignments total="1">
<alignment number="1">
<score>92</score>
<smithWatermanScore>92</smithWatermanScore>
<initn>43</initn>
<init1>43</init1>
<opt>92</opt>
<zScore>128.8</zScore>
<bits>31.3</bits>
<expectation>5.0</expectation>
<identity>27.0</identity>
<positives>66.2</positives>
<overlap>74</overlap>
<strand>none/plus</strand>
<querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
<pattern> :.... .::. ::. :.: .::.. : .:. .. .::.... ...:. :.. ::.:. :. : </pattern>
<matchSeq start="13" end="82"> MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
</alignment>
</alignments>
</hit>
</hits>
</SequenceSimilaritySearchResult></EBIApplicationResult>
I chose “xml” format. The result must be decoded with base64.
Accessing Uniprot
With urllib library you can access Uniprot(http://www.uniprot.org/) in very straightforward way.
import urllib
args = {
'query':'gene:ParC active:yes reviewed:yes',
'format':'xml',
'offset':0,
'limit':15
}
encoded_args = urllib.urlencode( args )
url = "http://www.uniprot.org/uniprot/"
xml_data = urllib.urlopen( url, data = encoded_args )
When I first saw the encoded string, I thought something was wrong because it included some weird characters.
print encoded_args
query=gene%3AParC+active%3Ayes+reviewed%3Ayes&offset=0&limit=15&format=xml
But I found that was OK. The special characters were expressed in unicode, like, ‘%3A’ = ‘:’. Those encoded strings are transformed into human-readable form with urllib.unquote().
print urllib.unquote(encoded_args)
query=gene:ParC+active:yes+reviewed:yes&offset=0&limit=15&format=xml
The meanings of arguments are written in this page(http://www.uniprot.org/faq/28).
Once you get the xml data you can extract any information from that as you like.
from xml.dom import minidom
dom = minidom.parse( xml_data )
entries = dom.getElementsByTagName("entry")
for entry in entries:
# entry name
ent_name = entry.getElementsByTagName("name")
print ent_name[0].childNodes[0].data
# Organism
orgs = entry.getElementsByTagName("organism")
org_names = orgs[0].getElementsByTagName("name")
sciname = commname = ""
for org_name in org_names:
t = org_name.getAttribute("type")
if "scientific" == t:
sciname = org_name.childNodes[0].data
elif "common" == t:
commname = org_name.childNodes[0].data
if len(commname)>0:
sciname += ' (' + commname + ')'
print "- ",sciname
# PDB data
pdbids = []
dbrefs = entry.getElementsByTagName("dbReference")
for dbref in dbrefs:
t = dbref.getAttribute("type")
if "PDB" == t:
pdbids.append( dbref.getAttribute("id") )
if len(pdbids)>0:
print "- PDBID: ",",".join(pdbids)
CCL18_HUMAN
– Homo sapiens (Human)
CDP1_ARATH
– Arabidopsis thaliana (Mouse-ear cress)
CUL9_HUMAN
– Homo sapiens (Human)
– PDBID: 2JUF
CUL9_MOUSE
– Mus musculus (Mouse)
DOIS_STRRY
– Streptomyces rimosus subsp. paromomycinus
GSTXC_TOBAC
– Nicotiana tabacum (Common tobacco)
ORC1_HUMAN
– Homo sapiens (Human)
PARC4_ECOLX
– Escherichia coli
PARC_BACSU
– Bacillus subtilis (strain 168)
PARC_BORBU
– Borrelia burgdorferi (strain ATCC 35210 / B31 / CIP 102532 / DSM 4680)
PARC_CAUCR
– Caulobacter crescentus (strain ATCC 19089 / CB15)
PARC_CROSK
– Cronobacter sakazakii
PARC_ECO57
– Escherichia coli O157:H7
PARC_ECOLI
– Escherichia coli (strain K12)
– PDBID: 1ZVT,1ZVU
PARC_HAEIN
– Haemophilus influenzae (strain ATCC 51907 / DSM 11121 / KW20 / Rd)
Posting queries to RCSB PDB
The RCSB PDB is providing RESTful Web Service(http://www.rcsb.org/pdb/software/rest.do). You can execute advanced search by sending XML data. Some XML examples are shown by selecting query types from the pull-down at the bottom of the page. The page also shows an example code written by python (http://www.rcsb.org/pdb/software/static.do?p=/software/webservices/search_nmr.jsp). Although the code is written with urllib2, I rewrote the code with httplib2 so that I could keep the consistency with the previous code (And another reason was that urllib2 was felt difficult for me to handle the URL redirection ). If you use httplib2, you have to specify the http header strictly.
import httplib2
def get_pdbidlist_bykeyword( keyword ):
url = "http://www.rcsb.org/pdb/rest/search"
# XML block copy and pasted from RCSB PDB
# No indentation is allowed
queryText = """
<?xml version="1.0" encoding="UTF-8"?>
<orgPdbQuery>
<queryType>org.pdb.query.simple.AdvancedKeywordQuery</queryType>
<description>Text Search</description>
<keywords>%s</keywords>
</orgPdbQuery>
""" % keyword
q_header = {'Content-type': 'application/x-www-form-urlencoded'}
conn = httplib2.Http()
(response, body) = conn.request( url, "POST", body = queryText, headers=q_header)
pdbid_list = []
http_status = response["status"]
if( http_status == "200"):
pdbid_list = body.strip().split('\n')
else:
print "HTTP Response: ", http_status
return pdbid_list
Here is an usage example of this function.
pdbid_list = get_pdbidlist_bykeyword( "gyrA" )
for pid in pdbid_list:
print '>', pid
> 1AB4
> 1AM2
> 1SUU
> 1X75
> 1ZI0
> 1ZVU
> 2BH8
> 2XCO
> 2XCQ
> 2XCR
> 2XCS
> 2XCT
> 2Y3P
> 2ZJT
> 3G7Z
> 3IFZ
> 3ILW
> 3L6V
> 3LPX
> 3NO0
> 3NUH
> 3TTZ
> 3UC1
> 4DDQ
> 4ELY
> 4ELZ
With download_pdbfile() function written in here(https://pachiras.wordpress.com/2013/03/24/downloading-pdb-files/) you can download all the pdb files related to the keyword you specify.
Downloading PDB files
Downloading PDB files is one of my daily tasks. So I wrote this to make it easier.
import httplib2, StringIO, gzip
def download_pdbfile( pdbid ):
if(len(pdbid) != 4 ):
print "Invalid id: ", pdbid
return
url = 'http://www.rcsb.org/pdb/files/%s.pdb.gz' % pdbid
conn = httplib2.Http()
(response, body) = conn.request( url )
enc_body = StringIO.StringIO( body )
content = gzip.GzipFile( fileobj= enc_body ).read()
local_file = open(pdbid+'.pdb','wb')
local_file.write( content )
local_file.close()
