Python

Listing AA sequence entries for a specific strain

Posted on Updated on

If you study bacteria, you should sometimes need amino acid sequences for your strain including mutants. Following codes are what I wrote for that purpose.

Let’s say we want to check UniProt for S. aureus ParC.
We, firstly, need the taxonomy ID for S. aureus. Since UniProt uses the same taxonomy classes as in NCBI, here we access the site with SOAP.

from suds.client import Client

wsdl_file = 'http://www.ncbi.nlm.nih.gov/soap/v2.0/eutils.wsdl'
client = Client( wsdl_file )

print client

Suds ( https://fedorahosted.org/suds/ )  version: 0.4.1 (beta)  build: R703-20101015

Service ( eUtilsService ) tns="http://www.ncbi.nlm.nih.gov/soap/eutils/"
   Prefixes (6)
      ns0 = "http://www.ncbi.nlm.nih.gov/soap/eutils/egquery"
      ns1 = "http://www.ncbi.nlm.nih.gov/soap/eutils/einfo"
      ns2 = "http://www.ncbi.nlm.nih.gov/soap/eutils/elink"
      ns3 = "http://www.ncbi.nlm.nih.gov/soap/eutils/epost"
      ns4 = "http://www.ncbi.nlm.nih.gov/soap/eutils/esearch"
      ns5 = "http://www.ncbi.nlm.nih.gov/soap/eutils/esummary"
   Ports (1):
      (eUtilsServiceSoap)
         Methods (7):
            run_eGquery(xs:string term, xs:string tool, xs:string email, )
            run_eInfo(xs:string db, xs:string tool, xs:string email, )
            run_eLink(xs:string db, xs:string[] id, xs:string reldate, xs:string mindate, xs:string maxdate, xs:string datetype, xs:string term, xs:string dbfrom, xs:string linkname, xs:string WebEnv, xs:string query_key, xs:string cmd, xs:string tool, xs:string email, )
            run_ePost(xs:string db, xs:string id, xs:string WebEnv, xs:string tool, xs:string email, )
            run_eSearch(xs:string db, xs:string term, xs:string WebEnv, xs:string QueryKey, xs:string usehistory, xs:string tool, xs:string email, xs:string field, xs:string reldate, xs:string mindate, xs:string maxdate, xs:string datetype, xs:string RetStart, xs:string RetMax, xs:string rettype, xs:string sort, )
            run_eSpell(xs:string db, xs:string term, xs:string tool, xs:string email, )
            run_eSummary(xs:string db, xs:string id, xs:string WebEnv, xs:string query_key, xs:string retstart, xs:string retmax, xs:string tool, xs:string email, )
         Types (34):
            ns1:DbInfoType
            ns1:DbListType
            ns5:DocSumType
            ns4:ErrorListType
            ns1:FieldListType
            ns1:FieldType
            ns2:FirstCharsType
            ns2:IconUrlType
            ns2:IdCheckListType
            ns2:IdLinkSetType
            ns2:IdListType
            ns4:IdListType
            ns2:IdType
            ns2:IdUrlListType
            ns2:IdUrlSetType
            ns3:InvalidIdListType
            ns5:ItemType
            ns2:LinkInfoType
            ns1:LinkListType
            ns2:LinkSetDbHistoryType
            ns2:LinkSetDbType
            ns2:LinkSetType
            ns1:LinkType
            ns2:LinkType
            ns2:ObjUrlType
            ns2:ProviderType
            ns0:ResultItemType
            ns4:TermSetType
            ns4:TranslationSetType
            ns4:TranslationStackType
            ns4:TranslationType
            ns2:UrlType
            ns4:WarningListType
            ns0:eGQueryResultType

From the output above we are able to know that the service provides the method, run_eSearch(). The meanings of its arguments are listed in http://www.ncbi.nlm.nih.gov/books/NBK43082/.

Specifying the taxonomy DB from NCBI, you can easily get the ID of S.aureus.

reply = client.service.run_eSearch(db='taxonomy', term='Staphylococcus aureus[all names]')

print reply
if (reply.Count>0):
    taxoID = reply.IdList[0][0]
(reply){
   Count = "1"
   RetMax = "1"
   RetStart = "0"
   IdList = 
      (IdListType){
         Id[] = 
            "1280",
      }
   TranslationSet = ""
   TranslationStack = 
      (TranslationStackType){
         TermSet = 
            (TermSetType){
               Term = "Staphylococcus aureus[all names]"
               Field = "all names"
               Count = "1"
               Explode = "N"
            }
         OP = "GROUP"
      }
   QueryTranslation = "Staphylococcus aureus[all names]"
 }

Next, we post a query to UniProt as we did in here.

import urllib

args = {
        'query':'gene:ParC taxonomy:%s active:yes reviewed:yes' % taxoID,
        'format':'xml',
        'offset':0
        }
encoded_args = urllib.urlencode( args )

url = "http://www.uniprot.org/uniprot/"

xml_data = urllib.urlopen( url, data = encoded_args )

The returned XML will be analyzed accordingly.

from xml.dom import minidom

dom = minidom.parse( xml_data )
entries = dom.getElementsByTagName("entry")

for entry in entries:
    ent_name = entry.getElementsByTagName("name")
    print ent_name[0].childNodes[0].data
    organism = entry.getElementsByTagName("organism")
    sci_name = organism[0].getElementsByTagName("name")
    print sci_name[0].childNodes[0].data
PARC_STAA8
Staphylococcus aureus (strain NCTC 8325)
PARC_STAAC
Staphylococcus aureus (strain COL)
PARC_STAAM
Staphylococcus aureus (strain Mu50 / ATCC 700699)
PARC_STAAN
Staphylococcus aureus (strain N315)
PARC_STAAR
Staphylococcus aureus (strain MRSA252)
PARC_STAAS
Staphylococcus aureus (strain MSSA476)
PARC_STAAU
Staphylococcus aureus
PARC_STAAW
Staphylococcus aureus (strain MW2)

We’ve got eight entries.

Exporting eps files from sdfile

Posted on

I learned that pybel is able to export svg files. But when I tried to convert the svg files into eps files programatically, I found that it was not so easy. After looking around so many pages, I finally found the way.

import pybel
import rsvg, cairo

mols = list( pybel.readfile("sdf","approved.sdf") )
print "Compound Total:",len(mols)

molnames = []
for mol in mols[100:105]:
    base_name = mol.data['DRUGBANK_ID']
    mol.title = base_name
    
    # make svg handle
    svg_data = mol.write("svg")
    h_svg = rsvg.Handle(data=svg_data)
    
    # make cairo context
    srf = cairo.PSSurface(base_name + ".eps", h_svg.props.width, h_svg.props.height)
    srf.set_eps(True)
    ctx = cairo.Context( srf )

    # make the rsvg handle draw on the cairo context
    h_svg.render_cairo(ctx)
    
    #finish drawing
    srf.finish()

If you need png files, you can replace the rendering part with this:

srf.write_to_png(base_name + ".png")

Altanatively, you can create PostScript files with rdkit.

from rdkit.Chem import SDMolSupplier,AllChem, Draw

suppl = SDMolSupplier("approved.sdf")

for i in range(100,105):
    mol = suppl[i]
    t = AllChem.Compute2DCoords(mol)
    base_name = mol.GetProp('DRUGBANK_ID')
    Draw.MolToFile(mol, base_name + ".ps", imageType="ps")

This is much simpler. But you have to convert the ps-files with another tool like “ps2eps”, which is not what I meant.

Displaying chemical structures on IPython Notebook

Posted on

IPython is very useful for research because you can record everything what you were thinking at that time. And this is more important: it’s simply fun to play around with!

I found some people mentioned about how to display chemical structures on IPython Notebook:
http://baoilleach.blogspot.jp/2012_03_01_archive.html
http://code.google.com/p/rdkit/wiki/IPythonIntegration

But what I wanted to do was to display all the structures in a sdfile. And, if possible, I wanted to show them in vector format. So I wrote this:

from IPython.core.display import display_svg, SVG
import pybel

mols = list( pybel.readfile("sdf","approved.sdf") )
print "Approved Drugs TOTAL:",len(mols)

molimgs = []
for mol in mols[100:105]:
    mol.removeh()
    molimgs.append( SVG(mol.write("svg")) )

display_svg( *molimgs ) 

ChemstructsOnIPython

Thd sdfile was downloaded from DrugBank(http://www.drugbank.ca/).

Blasting EBI services 2

Posted on Updated on

Once you get to know how the services work in EBI, it’s easy to use them. Let’s retrieve some amino acid sequence data from the database.

from suds.client import Client

wsdfUrl = 'http://www.ebi.ac.uk/ws/services/WSDbfetch?wsdl'

clw2Url = 'http://www.ebi.ac.uk/Tools/services/soap/clustalw2?wsdl'
client = Client(wsdfUrl)

ids ="PARC_BACSU PARC_BORBU PARC_CAUCR PARC_ECO57 PARC_ECOLI"
seqs = client.service.fetchBatch("uniprotkb", ids, format="fasta", style="raw")

print seqs

 

>sp|Q45066|PARC_BACSU DNA topoisomerase 4 subunit A OS=Bacillus subtilis (strain 168) GN=parC PE=3 SV=2
MSQPELFHDLPLEEVIGDRFGRYSKYIIQDRALPDARDGLKPVQRRILYAMHTDGNTFDK
NFRKAAKTVGNVIGNYHPHGDSSVYEAMVRMSQDWKVRNVLIEMHGNNGSIDGDPPAAMR
YTEARLSPIASELLRDIDKNTVEFVPNFDDTSKEPVVLPAMFPNLLVNGSTGISAGYATD
IPPHHLGEVIDAVIKRIQMPSCSVDELMEVIKGPDFPTGGIIQGVDGIRKAYETGKGKII
IRGKAEIETIRGGREQIVITEIPFEVNKANLVKKMDEFRIDKKVEGISEVRDETDRTGLR
VVIELKKEADAKGILNFLYKNTDLQITYNFNMVAIHNRRPMLMSLPSILDAYIGHQKEVV
TNRSVYELQKAKDRHHIVEGLMKALSILDEVIATIRSSSDKRDAKNNLIAKYEFTEPQAE
AIVSLQLYRLTNTDITALKEEAEELGKKIEELESILSNDKKLLKVITNSLKALKKKYADT
RRSVIEEKIEEIKINLEVMVASEDVYVTVTKDGYLKRTSQRSFAASNGQDFGMKDTDRML
HQFEMNTTDVLLLFTNKGSYIYCPVHQLPDIRWKDMGQHFSNLITIDRDETIVKAIPIKE
FDPSAYLLFFTKNGMVKKTELTHYKAQRYSKALVALNLKGEDELIDVHVTNGESQIFMAT
HLGYGLWFGEDEVNVVGARAAGVKGINLKEDDFVVSGEILQQSDSIVLFTQRGAVKRMSL
SEFEKTSRAKRGVVMLRELKKNPHRVVALFACGLEQRLMAETEKGDRKELQTKELRTNDR
YSNGSFFFDEEESGKVTAVWRLHTEQ
>sp|O51066|PARC_BORBU DNA topoisomerase 4 subunit A OS=Borrelia burgdorferi (strain ATCC 35210 / B31 / CIP 102532 / DSM 4680) GN=parC PE=3 SV=1
MDIRTVLKDNFLQYSSYVIKDRAIASVVDGFKPVQRRIIHSLFEMHDGNFHKVANVVGNT
MKYHPHGDTSIYEALVNIANKDLFIEKQGNFGNLFTGDPASASRYIECRLTPLAFDVLYS
KEITIYESSYDGRNNEPLLYPAKIPVILIQGSEGIAVGMAAKILPHNFNEILNAVKSELL
GESYDIYPDFPTGGIVDVNEYADGNGKVLVRAKIETIDEKTIVIRELPFGETTESLISSI
EKAIRKNYIKVSSINDFTAENVAIELSLPRGVYASEVIEKLYHYTNCQISISVNLLLLSE
RYPVVYTIKDLIKFHAAHLQKILKMELELQKSKILEKIFYKTLEQIFIEKKIYKLLETIS
KEENILSIILSEVLRHKESFSREVLKEDVENLLKIPIRKISLFDIDKNSKDIKILNKELK
SINSNISSIRGYSINFIDLLLAKYSKEHQRKTKISLIKSKNVKEIATKNMKVYLNLAEGF
AGTSLFDGEFIGNASYYDKILVFRENSYVLKNIEDKTFIDKKNVCALVYDINNSKEQIFS
IIYFNRLDNFYYVKRFKIDKFITDKVYEFLGENDEFVDFSLNPEFVEFSTNKDIVKRIEI
DNFMVKSRSSIGKRISSNNLKKVKFK
>sp|O54478|PARC_CAUCR DNA topoisomerase 4 subunit A OS=Caulobacter crescentus (strain ATCC 19089 / CB15) GN=parC PE=3 SV=2
MNKPVLPPPGGPDDGDRILDEPLTEALSRRYLAYALSTIGSRALPDVRDGLKPVHRRVLY
AMSNMRLNPDAAARKCAKVVGEVMGNFHPHGDASIYDALVRLAQEFSQRIPLVEGQGNFG
NIDGDSAAAMRYTECKMTEAAMLLLDGIDEDAVDFRPTYDGQDEEPVVLPSGFPNLLANG
SSGIAVGMATSIPPHNAAELIDACQLLLANPDATTADLLEKVPGPDFPTGGVIVESRASL
LETYETGRGGVRMRAKWEKEDTGRGTYQIVVTEIPYQVKKSDLVEQLADLIDSKKAALLG
DVRDESAEDIRLVLEPKSKNVEPEVLMESLFKLSALESRFPVNINVLDARGTPGVMGIKQ
ALMAFLAHRREVLTRRARHRLAKIEARLHILDGLLIAYLNLDEVIRIVRYEDKPKEKLIE
TFGLTDIQADAILNTRLRQLAKLEEMEIRREHAELVEERDGILAMLASEAKQWKLVGVGL
SEVRAALLKIKHPLDKPRPTGVTGRSVFGEAPQVDADAAIEAMIVREPITIILSERGWIR
AAKGKIDDPSELKFKEGDKLGFLVPAETTDKLLIFSSDGRFFTLGCDKLPSARGHGEPVR
MMIELDDKVKIIDVFPFKAGRKRILASKGGYGFLMPEEEALANRKAGKQVLNVGNEGAAF
CLEAVGDQLAVIGDNGKILIFPLEELPEMPRGKGVKLQAYREGGLRDGLSFNAETGAYWI
DTAGRRRDWAEWKEWVGRRAGAGKLVPKGFATNKRFRPK
>sp|P0AFI3|PARC_ECO57 DNA topoisomerase 4 subunit A OS=Escherichia coli O157:H7 GN=parC PE=3 SV=1
MSDMAERLALHEFTENAYLNYSMYVIMDRALPFIGDGLKPVQRRIVYAMSELGLNASAKF
KKSARTVGDVLGKYHPHGDSACYEAMVLMAQPFSYRYPLVDGQGNWGAPDDPKSFAAMRY
TESRLSKYSELLLSELGQGTADWVPNFDGTLQEPKMLPARLPNILLNGTTGIAVGMATDI
PPHNLREVAQAAIALIDQPKTTLDQLLDIVQGPDYPTEAEIITSRAEIRKIYENGRGSVR
MRAVWKKEDGAVVISALPHQVSGARVLEQIAAQMRNKKLPMVDDLRDESDHENPTRLVIV
PRSNRVDMDQVMNHLFATTDLEKSYRINLNMIGLDGRPAVKNLLEILSEWLVFRRDTVRR
RLNYRLEKVLKRLHILEGLLVAFLNIDEVIEIIRNEDEPKPALMSRFGLTETQAEAILEL
KLRHLAKLEEMKIRGEQSELEKERDQLQGILASERKMNNLLKKELQADAQAYGDDRRSPL
QEREEAKAMSEHDMLPSEPVTIVLSQMGWVRSAKGHDIDAPGLNYKAGDSFKAAVKGKSN
QPVVFVDSTGRSYAIDPITLPSARGQGEPLTGKLTLPPGATVDHMLMESDDQKLLMASDA
GYGFVCTFNDLVARNRAGKALITLPENAHVMPPVVIEDASDMLLAITQAGRMLMFPVSDL
PQLSKGKGNKIINIPSAEAARGEDGLAQLYVLPPQSTLTIHVGKRKIKLRPEELQKVTGE
RGRRGTLMRGLQRIDRVEIDSPRRASSGDSEE
>sp|P0AFI2|PARC_ECOLI DNA topoisomerase 4 subunit A OS=Escherichia coli (strain K12) GN=parC PE=1 SV=1
MSDMAERLALHEFTENAYLNYSMYVIMDRALPFIGDGLKPVQRRIVYAMSELGLNASAKF
KKSARTVGDVLGKYHPHGDSACYEAMVLMAQPFSYRYPLVDGQGNWGAPDDPKSFAAMRY
TESRLSKYSELLLSELGQGTADWVPNFDGTLQEPKMLPARLPNILLNGTTGIAVGMATDI
PPHNLREVAQAAIALIDQPKTTLDQLLDIVQGPDYPTEAEIITSRAEIRKIYENGRGSVR
MRAVWKKEDGAVVISALPHQVSGARVLEQIAAQMRNKKLPMVDDLRDESDHENPTRLVIV
PRSNRVDMDQVMNHLFATTDLEKSYRINLNMIGLDGRPAVKNLLEILSEWLVFRRDTVRR
RLNYRLEKVLKRLHILEGLLVAFLNIDEVIEIIRNEDEPKPALMSRFGLTETQAEAILEL
KLRHLAKLEEMKIRGEQSELEKERDQLQGILASERKMNNLLKKELQADAQAYGDDRRSPL
QEREEAKAMSEHDMLPSEPVTIVLSQMGWVRSAKGHDIDAPGLNYKAGDSFKAAVKGKSN
QPVVFVDSTGRSYAIDPITLPSARGQGEPLTGKLTLPPGATVDHMLMESDDQKLLMASDA
GYGFVCTFNDLVARNRAGKALITLPENAHVMPPVVIEDASDMLLAITQAGRMLMFPVSDL
PQLSKGKGNKIINIPSAEAARGEDGLAQLYVLPPQSTLTIHVGKRKIKLRPEELQKVTGE
RGRRGTLMRGLQRIDRVEIDSPRRASSGDSEE

Align the retrieved sequences, then.

from suds.client import Client

clw2Url = 'http://www.ebi.ac.uk/Tools/services/soap/clustalw2?wsdl'
client = Client(clw2Url)

print client
print client.service.getParameters()
print client.service.getParameterDetails("outorder")

clw2_params = {
    "outorder":"input",
    "sequence":seqs
}

jobid = client.service.run(email="pachiras@localhost", title="testjob", parameters=clw2_params)

Suds ( https://fedorahosted.org/suds/ )  version: 0.4.1 (beta)  build: R703-20101015

Service ( JDispatcherService ) tns="http://soap.jdispatcher.ebi.ac.uk"
   Prefixes (1)
      ns0 = "http://soap.jdispatcher.ebi.ac.uk"
   Ports (1):
      (JDispatcherServiceHttpPort)
         Methods (6):
            getParameterDetails(xs:string parameterId, )
            getParameters()
            getResult(xs:string jobId, xs:string type, wsRawOutputParameters parameters, )
            getResultTypes(xs:string jobId, )
            getStatus(xs:string jobId, )
            run(xs:string email, xs:string title, InputParameters parameters, )
         Types (12):
            ArrayOfString
            InputParameters
            wsParameterDetails
            wsParameterValue
            wsParameterValues
            wsParameters
            wsProperties
            wsProperty
            wsRawOutputParameter
            wsRawOutputParameters
            wsResultType
            wsResultTypes

(wsParameters){
   id[] =
      "alignment",
      "type",
      "ktup",
      "window",
      "score",
      "topdiags",
      "pairgap",
      "pwmatrix",
      "pwdnamatrix",
      "pwgapopen",
      "pwgapext",
      "matrix",
      "dnamatrix",
      "gapopen",
      "noendgaps",
      "gapext",
      "gapdist",
      "iteration",
      "numiter",
      "clustering",
      "output",
      "outorder",
      "sequence",
 }
(wsParameterDetails){
   name = "Output order"
   description = "The order in which the sequences appear in the final alignment"
   type = "STRING"
   values =
      (wsParameterValues){
         value[] =
            (wsParameterValue){
               label = "aligned"
               value = "aligned"
               defaultValue = True
            },
            (wsParameterValue){
               label = "input"
               value = "input"
               defaultValue = False
            },
      }
 }

Wait till the calculation finishes.

import base64
import time

status = 'PENDING'
while status == 'RUNNING' or status == 'PENDING':
    status = client.service.getStatus(jobid)
    print status
    if status == 'RUNNING' or status == 'PENDING':
        time.sleep(15)

print client.service.getResultTypes(jobid)

result = client.service.getResult(jobid, "aln-clustalw")
dec = base64.b64decode( result )

print dec
FINISHED
(wsResultTypes){
   type[] =
      (wsResultType){
         description = "The output from the tool itself"
         fileSuffix = "txt"
         identifier = "out"
         label = "Tool Output"
         mediaType = "text/plain"
      },
      (wsResultType){
         description = "Your input sequences as seen by the tool"
         fileSuffix = "txt"
         identifier = "sequence"
         label = "Input Sequences"
         mediaType = "text/plain"
      },
      (wsResultType){
         description = "The alignment in CLUSTAL format"
         fileSuffix = "clustalw"
         identifier = "aln-clustalw"
         label = "Alignment in CLUSTAL format"
         mediaType = "text/x-clustalw-alignment"
      },
      (wsResultType){
         description = "The alignment guide tree"
         fileSuffix = "dnd"
         identifier = "tree"
         label = "Guide Tree"
         mediaType = "text/x-clustalw-tree"
      },
 }
CLUSTAL 2.1 multiple sequence alignment

sp|Q45066|PARC_BACSU      -----------MSQPELFHDLPLEEVIGDRFGRYSKYIIQDRALPDARDGLKPVQRRILY
sp|O51066|PARC_BORBU      --------------------MDIRTVLKDNFLQYSSYVIKDRAIASVVDGFKPVQRRIIH
sp|O54478|PARC_CAUCR      MNKPVLPPPGGPDDGDRILDEPLTEALSRRYLAYALSTIGSRALPDVRDGLKPVHRRVLY
sp|P0AFI3|PARC_ECO57      ----------MSDMAERLALHEFTEN---AYLNYSMYVIMDRALPFIGDGLKPVQRRIVY
sp|P0AFI2|PARC_ECOLI      ----------MSDMAERLALHEFTEN---AYLNYSMYVIMDRALPFIGDGLKPVQRRIVY
                                                :       :  *:   * .**:.   **:***:**:::

sp|Q45066|PARC_BACSU      AMHTDGNTFDKNFRKAAKTVGNVIGNYHPHGDSSVYEAMVRMSQDWKVRNVLIEMHGNNG
sp|O51066|PARC_BORBU      SLFEM---HDGNFHKVANVVGNTM-KYHPHGDTSIYEALVNIAN----KDLFIEKQGNFG
sp|O54478|PARC_CAUCR      AMSNMRLNPDAAARKCAKVVGEVMGNFHPHGDASIYDALVRLAQEFSQRIPLVEGQGNFG
sp|P0AFI3|PARC_ECO57      AMSELGLNASAKFKKSARTVGDVLGKYHPHGDSACYEAMVLMAQPFSYRYPLVDGQGNWG
sp|P0AFI2|PARC_ECOLI      AMSELGLNASAKFKKSARTVGDVLGKYHPHGDSACYEAMVLMAQPFSYRYPLVDGQGNWG
                          ::       .   :* *..**:.: ::*****:: *:*:* :::    :  ::: :** *

sp|Q45066|PARC_BACSU      SIDG-DPPAAMRYTEARLSPIASELLRDIDKNTVEFVPNFDDTSKEPVVLPAMFPNLLVN
sp|O51066|PARC_BORBU      NLFTGDPASASRYIECRLTPLAFDVL--YSKEITIYESSYDGRNNEPLLYPAKIPVILIQ
sp|O54478|PARC_CAUCR      NIDG-DSAAAMRYTECKMTEAAMLLLDGIDEDAVDFRPTYDGQDEEPVVLPSGFPNLLAN
sp|P0AFI3|PARC_ECO57      APDDPKSFAAMRYTESRLSKYSELLLSELGQGTADWVPNFDGTLQEPKMLPARLPNILLN
sp|P0AFI2|PARC_ECOLI      APDDPKSFAAMRYTESRLSKYSELLLSELGQGTADWVPNFDGTLQEPKMLPARLPNILLN
                               .. :* ** *.:::  :  :*   .:  . : ..:*.  :** : *: :* :* :

sp|Q45066|PARC_BACSU      GSTGISAGYATDIPPHHLGEVIDAVIKRIQMPSCSVDELMEVIKGPDFPTGG-IIQGVDG
sp|O51066|PARC_BORBU      GSEGIAVGMAAKILPHNFNEILNAVKSELLGESYDIY--------PDFPTGGIVDVN---
sp|O54478|PARC_CAUCR      GSSGIAVGMATSIPPHNAAELIDACQLLLANPDATTADLLEKVPGPDFPTGGVIVESRAS
sp|P0AFI3|PARC_ECO57      GTTGIAVGMATDIPPHNLREVAQAAIALIDQPKTTLDQLLDIVQGPDYPTEAEIITSRAE
sp|P0AFI2|PARC_ECOLI      GTTGIAVGMATDIPPHNLREVAQAAIALIDQPKTTLDQLLDIVQGPDYPTEAEIITSRAE
                          *: **:.* *:.* **:  *: :*    :   .            **:** . :  .

sp|Q45066|PARC_BACSU      IRKAYETGKGKIIIRGKAEIETIRGGREQIVITEIPFEVNKANLVKKMDEFRIDKKVEGI
sp|O51066|PARC_BORBU      ---EYADGNGKVLVRAKIETIDEK----TIVIRELPFGETTESLISSIEKAIR-KNYIKV
sp|O54478|PARC_CAUCR      LLETYETGRGGVRMRAKWEKEDTGRGTYQIVVTEIPYQVKKSDLVEQLADLIDSKKAALL
sp|P0AFI3|PARC_ECO57      IRKIYENGRGSVRMRAVWKKEDG-----AVVISALPHQVSGARVLEQIAAQMRNKKLPMV
sp|P0AFI2|PARC_ECOLI      IRKIYENGRGSVRMRAVWKKEDG-----AVVISALPHQVSGARVLEQIAAQMRNKKLPMV
                              *  *.* : :*.  :          :*:  :*.  .   ::..:      *:   :

sp|Q45066|PARC_BACSU      SEVRDETDRTG--LRVVIELKKEADAKGILNFLYKNTDLQITYNFNMVAIHNR-RPMLMS
sp|O51066|PARC_BORBU      SSINDFTAEN---VAIELSLPRGVYASEVIEKLYHYTNCQISISVNLLLLSER-YPVVYT
sp|O54478|PARC_CAUCR      GDVRDESAED--IRLVLEPKSKNVEPEVLMESLFKLSALESRFPVNINVLDARGTPGVMG
sp|P0AFI3|PARC_ECO57      DDLRDESDHENPTRLVIVPRSNRVDMDQVMNHLFATTDLEKSYRINLNMIGLDGRPAVKN
sp|P0AFI2|PARC_ECOLI      DDLRDESDHENPTRLVIVPRSNRVDMDQVMNHLFATTDLEKSYRINLNMIGLDGRPAVKN
                          ..:.* : .      :     . .  . ::: *:  :  :    .*:  :     * :

sp|Q45066|PARC_BACSU      LPSILDAYIGHQKEVVTNRSVYELQKAKDRHHIVEGLMKALSILDEVIATIRSSSDKRDA
sp|O51066|PARC_BORBU      IKDLIKFHAAHLQKILKMELELQKSKILEKIFYKT--LEQIFIEKKIYKLLETISK----
sp|O54478|PARC_CAUCR      IKQALMAFLAHRREVLTRRARHRLAKIEARLHILDGLLIAYLNLDEVIRIVRYEDK---P
sp|P0AFI3|PARC_ECO57      LLEILSEWLVFRRDTVRRRLNYRLEKVLKRLHILEGLLVAFLNIDEVIEIIRNEDE---P
sp|P0AFI2|PARC_ECOLI      LLEILSEWLVFRRDTVRRRLNYRLEKVLKRLHILEGLLVAFLNIDEVIEIIRNEDE---P
                          : . :     . :. :  .   .  *   : .     :      .::   :.  ..

sp|Q45066|PARC_BACSU      KNNLIAKYEFTEPQAEAIVSLQLYRLTNTDITALKEEAEELGKKIEELESILSNDKKLLK
sp|O51066|PARC_BORBU      ----------EENILSIILSEVLRHKESFSREVLKEDVENLLKIPIRKISLFDIDKNSKD
sp|O54478|PARC_CAUCR      KEKLIETFGLTDIQADAILNTRLRQLAKLEEMEIRREHAELVEERDGILAMLASEAKQWK
sp|P0AFI3|PARC_ECO57      KPALMSRFGLTETQAEAILELKLRHLAKLEEMKIRGEQSELEKERDQLQGILASERKMNN
sp|P0AFI2|PARC_ECOLI      KPALMSRFGLTETQAEAILELKLRHLAKLEEMKIRGEQSELEKERDQLQGILASERKMNN
                                     :   . *:.  * :  . .   :: :  :* :      .::  : :  .

sp|Q45066|PARC_BACSU      VIT-------------NSLKALKKKYADTRRSVIEEK-IEEIKINLEVMVASEDVYVTVT
sp|O51066|PARC_BORBU      IKILN-----------KELKSINSNISSIRGYSINFIDLLLAKYSKEHQRKTK---ISLI
sp|O54478|PARC_CAUCR      LVGVGLSEVRAALLKIKHPLDKPRPTGVTGRSVFGEAPQVDADAAIEAMIVREPITIILS
sp|P0AFI3|PARC_ECO57      LL-------------KKELQADAQAYGDDRRSPLQEREEAKAMSEHD-MLPSEPVTIVLS
sp|P0AFI2|PARC_ECOLI      LL-------------KKELQADAQAYGDDRRSPLQEREEAKAMSEHD-MLPSEPVTIVLS
                          :               :         .      :            :     :   : :

sp|Q45066|PARC_BACSU      KDGYLKRTSQRSFAASNGQDFGMKDTDRMLHQFEMNTTDVLLLFTNKGSYIYCPVHQLPD
sp|O51066|PARC_BORBU      KSKNVKEIATKNMKVYLNLAEGFAGTSLFDGEFIGNASYYDKILVFRENSYVLKNIEDKT
sp|O54478|PARC_CAUCR      ERGWIRAAKGKIDDPS---ELKFKEGDKLGFLVPAETTDKLLIFSSDGRFFTLGCDKLPS
sp|P0AFI3|PARC_ECO57      QMGWVRSAKGHDIDAP---GLNYKAGDSFKAAVKGKSNQPVVFVDSTGRSYAIDPITLPS
sp|P0AFI2|PARC_ECOLI      QMGWVRSAKGHDIDAP---GLNYKAGDSFKAAVKGKSNQPVVFVDSTGRSYAIDPITLPS
                          :   ::    :               . :   .  ::.    :.

sp|Q45066|PARC_BACSU      IRWKDMGQHFSNLITIDRDETIVKAIPIKEFDPSAYLLFFTKNGMVKKTELTHYKAQRYS
sp|O51066|PARC_BORBU      FIDKKN-----------------------------------------------------V
sp|O54478|PARC_CAUCR      ARGHGE------------------------------------------------------
sp|P0AFI3|PARC_ECO57      ARGQGE------------------------------------------------------
sp|P0AFI2|PARC_ECOLI      ARGQGE------------------------------------------------------
                             :

sp|Q45066|PARC_BACSU      KALVALNLKGEDELIDVHVTNGESQIFMATHLGYGLWFGEDEVNVVGARAAGVKGINLKE
sp|O51066|PARC_BORBU      CALVYDINNSKEQIFSIIYFNRLDNFYYVKRFKIDKFITDKVYEFLGENDE---------
sp|O54478|PARC_CAUCR      PVRMMIELDDKVKIIDVFPFKAGRKRILASKGGYGFLMPEE--EALANRKAGKQVLNVGN
sp|P0AFI3|PARC_ECO57      PLTGKLTLPPGATVDHMLMESDDQKLLMASDAGYGFVCTFN--DLVARNRAGKALITLPE
sp|P0AFI2|PARC_ECOLI      PLTGKLTLPPGATVDHMLMESDDQKLLMASDAGYGFVCTFN--DLVARNRAGKALITLPE
                                       :  :   .   :   ..    .     .  : :. .

sp|Q45066|PARC_BACSU      DDFVVSGEILQQS-DSIVLFTQRGAVKRMSLSEFEKTSRAKRGVVMLR---------ELK
sp|O51066|PARC_BORBU      ------FVDFSLNPEFVEFSTNKDIVKRIEIDNFMVKSRSSIG-----------------
sp|O54478|PARC_CAUCR      EG---AAFCLEAVGDQLAVIGDNGKILIFPLEELPEMPRGKG----VKLQ-------AYR
sp|P0AFI3|PARC_ECO57      NAHVMPPVVIEDASDMLLAITQAGRMLMFPVSDLPQLSKGKG----NKIINIPSAEAARG
sp|P0AFI2|PARC_ECOLI      NAHVMPPVVIEDASDMLLAITQAGRMLMFPVSDLPQLSKGKG----NKIINIPSAEAARG
                                   :.   : :    : . :  : :.::   .:..

sp|Q45066|PARC_BACSU      KNPHRVVALFACGLEQRLMAETEKGDRKELQTKELRTNDRYSNGSFFFDEEESGKVTAVW
sp|O51066|PARC_BORBU      ---------------------------------KRISSNNLKKVKFK-------------
sp|O54478|PARC_CAUCR      EGGLRDGLSFNAETGAYWIDTAGR--RRDWAEWKEWVGRRAGAGKLVP-------KGFAT
sp|P0AFI3|PARC_ECO57      EDGLAQLYVLPPQS-TLTIHVGKRKIKLRPEELQKVTGERGRRGTLMRGLQRIDRVEIDS
sp|P0AFI2|PARC_ECOLI      EDGLAQLYVLPPQS-TLTIHVGKRKIKLRPEELQKVTGERGRRGTLMRGLQRIDRVEIDS
                                                           :   . .    .:

sp|Q45066|PARC_BACSU      RLHTEQ-----
sp|O51066|PARC_BORBU      -----------
sp|O54478|PARC_CAUCR      NKRFRPK----
sp|P0AFI3|PARC_ECO57      PRRASSGDSEE
sp|P0AFI2|PARC_ECOLI      PRRASSGDSEE

This is great! I don’t have to be bothered with the piles of fasta files anymore. All I need is this script.

Blasting EBI Services 1

Posted on Updated on

Although I knew there were some programtical way to access EMBL-EBI databases(http://www.ebi.ac.uk/Tools/webservices/), I had no idea how to do it because I didn’t know about UDDI, WSDL, and SOAP.
After learning the concepts of those technologies I tried some examples in this page (http://www.ebi.ac.uk/Tools/webservices/tutorials/06_programming/python/soap/soappy). The first two example code worked without any modifications. But the third one, “Complex Data Structures”, didn’t work.

The cause of this problem was that the url(http://www.ebi.ac.uk/Tools/webservices/wsdl/WSWUBlast.wsdl) used in the code was not accessible anymore. The right url was “http://www.ebi.ac.uk/Tools/services/soap/fasta?wsdl”. Furthermore, the way to access the service had changed accordingly.

After some struggles, I modified the code as follows.

# Import WSDL package
from SOAPpy import WSDL

# Create service interface
wsdlUrl = 'http://www.ebi.ac.uk/Tools/services/soap/fasta?wsdl'
server = WSDL.Proxy(wsdlUrl)

# For debug
server.soapproxy.config.dumpSOAPOut = 1
server.soapproxy.config.dumpSOAPIn = 1

# Query sequence
seq = """>Q8E5Q5_STRA3
MKLSKRYRFWQKVIKALGVLALIATLVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFI
VAQIVQIVFPVIPGGVTTVAGFLIFGPTLGFIYNYIGIIIGSVILFWLVKFYGRKFVLLF
MDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPI
SIIGYSYLWIYGGDILKNFLN"""

dbnames = {'string':'pdb'}

# Structure containing parameters
blast_params = {
  'program':'fasta',      # Protein vs. protein search
  'database':dbnames,  # Database to search
  'scores':10,
  'alignments':10,
  'stype':'protein',
  'sequence':seq
}

server.methods['run'].namespace = server.wsdl.targetNamespace
jobid = server.run(email="pachiras@localhost", title="testjob", parameters=blast_params)

The point was to specify the namespace before calling each method. Since server.wsdl.targetNamespace holds the namespace written in the wsdl file( ‘http://www.ebi.ac.uk/Tools/services/soap/fasta?wsdl’), its value is substituted into server.method[‘some_method’].namespace.

With suds library, this code is rewritten in a more simple way.

from suds.client import Client

wsdlUrl = 'http://www.ebi.ac.uk/Tools/services/soap/fasta?wsdl'
client = Client(wsdlUrl)

seq = """MKLSKRYRFWQKVIKALGVLALIATLVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFI
VAQIVQIVFPVIPGGVTTVAGFLIFGPTLGFIYNYIGIIIGSVILFWLVKFYGRKFVLLF
MDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPI
SIIGYSYLWIYGGDILKNFLN"""

dbnames = {'string':'pdb'}

# Structure containing parameters
blast_params = {
  'program':'fasta',      # Protein vs. protein search
  'database':dbnames,  # Database to search
  'scores':10,
  'alignments':10,
  'stype':'protein',
  'sequence':seq
}

jobid = client.service.run(email="pachiras@localhost", title="testjob", parameters=blast_params)

If you’d like to know what type of methods the service is providing, simply type this.

print client

 

Suds ( https://fedorahosted.org/suds/ )  version: 0.4.1 (beta)  build: R703-20101015

Service ( JDispatcherService ) tns="http://soap.jdispatcher.ebi.ac.uk"
   Prefixes (1)
      ns0 = "http://soap.jdispatcher.ebi.ac.uk"
   Ports (1):
      (JDispatcherServiceHttpPort)
         Methods (6):
            getParameterDetails(xs:string parameterId, )
            getParameters()
            getResult(xs:string jobId, xs:string type, wsRawOutputParameters parameters, )
            getResultTypes(xs:string jobId, )
            getStatus(xs:string jobId, )
            run(xs:string email, xs:string title, InputParameters parameters, )
         Types (12):
            ArrayOfString
            InputParameters
            wsParameterDetails
            wsParameterValue
            wsParameterValues
            wsParameters
            wsProperties
            wsProperty
            wsRawOutputParameter
            wsRawOutputParameters
            wsResultType
            wsResultTypes

In those method getResultTypes() will give you the information about the format in which your result will be returned. So type this.

print client.service.getResultTypes(jobid)

 

(wsResultTypes){
   type[] =
      (wsResultType){
         description = "The output from the tool itself"
         fileSuffix = "txt"
         identifier = "out"
         label = "Tool Output"
         mediaType = "text/plain"
      },
      (wsResultType){
         description = "The M9 output from the tool itself"
         fileSuffix = "txt"
         identifier = "m9"
         label = "Tool M9 Output"
         mediaType = "text/plain"
      },
      (wsResultType){
         description = "Your input sequence as seen by the tool"
         fileSuffix = "txt"
         identifier = "sequence"
         label = "Input Sequence"
         mediaType = "text/plain"
      },
      (wsResultType){
         description = "The list of the alignment identifiers in the order they appear in the results"
         fileSuffix = "txt"
         identifier = "ids"
         label = "Alignment Identifiers"
         mediaType = "text/plain"
      },
      (wsResultType){
         description = "The results of the job in XML (EBI format)"
         fileSuffix = "xml"
         identifier = "xml"
         label = "XML Result"
         mediaType = "application/xml"
      },
      (wsResultType){
         description = "A visual representation of the results in SVG format"
         fileSuffix = "svg"
         identifier = "visual-svg"
         label = "Visual Result (SVG)"
         mediaType = "image/svg+xml"
      },
      (wsResultType){
         description = "A visual representation of the results in PNG format"
         fileSuffix = "png"
         identifier = "visual-png"
         label = "Visual Result (PNG)"
         mediaType = "image/png"
      },
      (wsResultType){
         description = "A visual representation of the results in JPEG format"
         fileSuffix = "jpg"
         identifier = "visual-jpg"
         label = "Visual Result (JPEG)"
         mediaType = "image/jpeg"
      },
 }

It takes some time until you get the result. So you have to wail and poll as in http://www.ebi.ac.uk/Tools/webservices/tutorials/06_programming/python/soap/soappy

import base64

status = 'PENDING'
while status == 'RUNNING' or status == 'PENDING':
    status = client.service.getStatus(jobid)
    print status
    if status == 'RUNNING' or status == 'PENDING':
        time.sleep(15)

result = client.service.getResult(jobid, "xml")
dec = base64.b64decode( result )

print dec
FINISHED
<?xml version='1.0' encoding='ISO-8859-1'?>
<EBIApplicationResult xmlns="http://www.ebi.ac.uk/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.ebi.ac.uk/Tools/common/schema/ApplicationResult.xsd">
 <Header>
  <program name="FASTA" version="36.3.5e Nov, 2012(preload8)"/>
  <commandLine command="cat fasta-S20130413-060528-0619-72925383-pg.sequence | /nfs/public/ro/es/appbin/linux-x86_64/fasta-36.3.5e-20121123/fasta36 -l $IDATA_CURRENT/fastacfg/fasta3db -L -T 8  -p -b 10 -d 10 -m &quot;F9 fasta-S20130413-060528-0619-72925383-pg.m9&quot; \@:1- +pdb+"/>
  <parameters>
   <sequences total="1">
    <sequence number="1" name="EMBOSS_001" type="PROTEIN" length="201"/>
   </sequences>
   <databases total="1" sequences="-1" letters="-1">
    <database number="1" name="pdb" type="NUCLEOTIDE"/>
   </databases>
   <alignments>10</alignments>
   <scores>10</scores>
  </parameters>
  <timeInfo start="2013-04-13T06:05:28+01:00" end="2013-04-13T06:05:32+01:00" search="P0Y0M0DT0H0M4.000S"/>
 </Header>
 <SequenceSimilaritySearchResult>
  <hits total="10">
   <hit number="1" database="PDB" id="4HYC_F" length="301" description="mol:protein length:301  Putative uncharacterized protein">
    <alignments total="1">
     <alignment number="1">
      <score>92</score>
      <smithWatermanScore>92</smithWatermanScore>
      <initn>43</initn>
      <init1>43</init1>
      <opt>92</opt>
      <zScore>128.8</zScore>
      <bits>31.3</bits>
      <expectation>5.0</expectation>
      <identity>27.0</identity>
      <positives>66.2</positives>
      <overlap>74</overlap>
      <strand>none/plus</strand>
      <querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
      <pattern>                              :.... .::. ::. :.:     .::.. : .:. ..    .::.... ...:. :..  ::.:.  :.    :                                                                            </pattern>
      <matchSeq start="13" end="82">                  MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
     </alignment>
    </alignments>
   </hit>
   <hit number="2" database="PDB" id="4HYC_E" length="301" description="mol:protein length:301  Putative uncharacterized protein">
    <alignments total="1">
     <alignment number="1">
      <score>92</score>
      <smithWatermanScore>92</smithWatermanScore>
      <initn>43</initn>
      <init1>43</init1>
      <opt>92</opt>
      <zScore>128.8</zScore>
      <bits>31.3</bits>
      <expectation>5.0</expectation>
      <identity>27.0</identity>
      <positives>66.2</positives>
      <overlap>74</overlap>
      <strand>none/plus</strand>
      <querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
      <pattern>                              :.... .::. ::. :.:     .::.. : .:. ..    .::.... ...:. :..  ::.:.  :.    :                                                                            </pattern>
      <matchSeq start="13" end="82">                  MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
     </alignment>
    </alignments>
   </hit>
   <hit number="3" database="PDB" id="4HYD_B" length="301" description="mol:protein length:301  Putative uncharacterized protein">
    <alignments total="1">
     <alignment number="1">
      <score>92</score>
      <smithWatermanScore>92</smithWatermanScore>
      <initn>43</initn>
      <init1>43</init1>
      <opt>92</opt>
      <zScore>128.8</zScore>
      <bits>31.3</bits>
      <expectation>5.0</expectation>
      <identity>27.0</identity>
      <positives>66.2</positives>
      <overlap>74</overlap>
      <strand>none/plus</strand>
      <querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
      <pattern>                              :.... .::. ::. :.:     .::.. : .:. ..    .::.... ...:. :..  ::.:.  :.    :                                                                            </pattern>
      <matchSeq start="13" end="82">                  MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
     </alignment>
    </alignments>
   </hit>
   <hit number="4" database="PDB" id="4HYG_C" length="301" description="mol:protein length:301  Putative uncharacterized protein">
    <alignments total="1">
     <alignment number="1">
      <score>92</score>
      <smithWatermanScore>92</smithWatermanScore>
      <initn>43</initn>
      <init1>43</init1>
      <opt>92</opt>
      <zScore>128.8</zScore>
      <bits>31.3</bits>
      <expectation>5.0</expectation>
      <identity>27.0</identity>
      <positives>66.2</positives>
      <overlap>74</overlap>
      <strand>none/plus</strand>
      <querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
      <pattern>                              :.... .::. ::. :.:     .::.. : .:. ..    .::.... ...:. :..  ::.:.  :.    :                                                                            </pattern>
      <matchSeq start="13" end="82">                  MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
     </alignment>
    </alignments>
   </hit>
   <hit number="5" database="PDB" id="4HYD_D" length="301" description="mol:protein length:301  Putative uncharacterized protein">
    <alignments total="1">
     <alignment number="1">
      <score>92</score>
      <smithWatermanScore>92</smithWatermanScore>
      <initn>43</initn>
      <init1>43</init1>
      <opt>92</opt>
      <zScore>128.8</zScore>
      <bits>31.3</bits>
      <expectation>5.0</expectation>
      <identity>27.0</identity>
      <positives>66.2</positives>
      <overlap>74</overlap>
      <strand>none/plus</strand>
      <querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
      <pattern>                              :.... .::. ::. :.:     .::.. : .:. ..    .::.... ...:. :..  ::.:.  :.    :                                                                            </pattern>
      <matchSeq start="13" end="82">                  MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
     </alignment>
    </alignments>
   </hit>
   <hit number="6" database="PDB" id="4HYD_C" length="301" description="mol:protein length:301  Putative uncharacterized protein">
    <alignments total="1">
     <alignment number="1">
      <score>92</score>
      <smithWatermanScore>92</smithWatermanScore>
      <initn>43</initn>
      <init1>43</init1>
      <opt>92</opt>
      <zScore>128.8</zScore>
      <bits>31.3</bits>
      <expectation>5.0</expectation>
      <identity>27.0</identity>
      <positives>66.2</positives>
      <overlap>74</overlap>
      <strand>none/plus</strand>
      <querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
      <pattern>                              :.... .::. ::. :.:     .::.. : .:. ..    .::.... ...:. :..  ::.:.  :.    :                                                                            </pattern>
      <matchSeq start="13" end="82">                  MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
     </alignment>
    </alignments>
   </hit>
   <hit number="7" database="PDB" id="4HYC_G" length="301" description="mol:protein length:301  Putative uncharacterized protein">
    <alignments total="1">
     <alignment number="1">
      <score>92</score>
      <smithWatermanScore>92</smithWatermanScore>
      <initn>43</initn>
      <init1>43</init1>
      <opt>92</opt>
      <zScore>128.8</zScore>
      <bits>31.3</bits>
      <expectation>5.0</expectation>
      <identity>27.0</identity>
      <positives>66.2</positives>
      <overlap>74</overlap>
      <strand>none/plus</strand>
      <querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
      <pattern>                              :.... .::. ::. :.:     .::.. : .:. ..    .::.... ...:. :..  ::.:.  :.    :                                                                            </pattern>
      <matchSeq start="13" end="82">                  MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
     </alignment>
    </alignments>
   </hit>
   <hit number="8" database="PDB" id="4HYD_A" length="301" description="mol:protein length:301  Putative uncharacterized protein">
    <alignments total="1">
     <alignment number="1">
      <score>92</score>
      <smithWatermanScore>92</smithWatermanScore>
      <initn>43</initn>
      <init1>43</init1>
      <opt>92</opt>
      <zScore>128.8</zScore>
      <bits>31.3</bits>
      <expectation>5.0</expectation>
      <identity>27.0</identity>
      <positives>66.2</positives>
      <overlap>74</overlap>
      <strand>none/plus</strand>
      <querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
      <pattern>                              :.... .::. ::. :.:     .::.. : .:. ..    .::.... ...:. :..  ::.:.  :.    :                                                                            </pattern>
      <matchSeq start="13" end="82">                  MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
     </alignment>
    </alignments>
   </hit>
   <hit number="9" database="PDB" id="4HYC_B" length="301" description="mol:protein length:301  Putative uncharacterized protein">
    <alignments total="1">
     <alignment number="1">
      <score>92</score>
      <smithWatermanScore>92</smithWatermanScore>
      <initn>43</initn>
      <init1>43</init1>
      <opt>92</opt>
      <zScore>128.8</zScore>
      <bits>31.3</bits>
      <expectation>5.0</expectation>
      <identity>27.0</identity>
      <positives>66.2</positives>
      <overlap>74</overlap>
      <strand>none/plus</strand>
      <querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
      <pattern>                              :.... .::. ::. :.:     .::.. : .:. ..    .::.... ...:. :..  ::.:.  :.    :                                                                            </pattern>
      <matchSeq start="13" end="82">                  MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
     </alignment>
    </alignments>
   </hit>
   <hit number="10" database="PDB" id="4HYC_D" length="301" description="mol:protein length:301  Putative uncharacterized protein">
    <alignments total="1">
     <alignment number="1">
      <score>92</score>
      <smithWatermanScore>92</smithWatermanScore>
      <initn>43</initn>
      <init1>43</init1>
      <opt>92</opt>
      <zScore>128.8</zScore>
      <bits>31.3</bits>
      <expectation>5.0</expectation>
      <identity>27.0</identity>
      <positives>66.2</positives>
      <overlap>74</overlap>
      <strand>none/plus</strand>
      <querySeq start="56" end="126">LVLVVYLYKLGILNDSNELKDLVHKYEFWGPMIFIVAQIVQIVFPVIPGGVTTVAGFLIF-GPT-LGFIYNYIGIIIG-SVILFWLVKFYGRKFVLLFMDQKTFDKYESKLETSGYEKFFIFCMASPISPADIMVMITGLSNMSIKRFVTIIMITKPISIIGYSYLWIYGGDILKNFLN </querySeq>
      <pattern>                              :.... .::. ::. :.:     .::.. : .:. ..    .::.... ...:. :..  ::.:.  :.    :                                                                            </pattern>
      <matchSeq start="13" end="82">                  MQIRDWLPLLGMPLMLLFVQIIAIVL-VMP---MQAAGLVAFENPSSVANPLIFIGMLLAFTLVLLVLLRTGGRRFIAAFIGFALFMTFLYIFGALSLLALGPTTAAAAGTLIGAVAVTALLYLYPEWYVIDILGVLISAGVASIFGISLEPLPVLVLLVLL</matchSeq>
     </alignment>
    </alignments>
   </hit>
  </hits>
 </SequenceSimilaritySearchResult></EBIApplicationResult>

I chose “xml” format. The result must be decoded with base64.

Accessing Uniprot

Posted on Updated on

With urllib library you can access Uniprot(http://www.uniprot.org/) in very straightforward way.

import urllib

args = {
'query':'gene:ParC active:yes reviewed:yes',
'format':'xml',
'offset':0,
'limit':15
}
encoded_args = urllib.urlencode( args )

url = "http://www.uniprot.org/uniprot/"

xml_data   = urllib.urlopen( url, data = encoded_args )

When I first saw the encoded string, I thought something was wrong because it included some weird characters.

print encoded_args

query=gene%3AParC+active%3Ayes+reviewed%3Ayes&offset=0&limit=15&format=xml

But I found that was OK. The special characters were expressed in unicode, like, ‘%3A’ = ‘:’. Those encoded strings are transformed into human-readable form with urllib.unquote().

print urllib.unquote(encoded_args)

query=gene:ParC+active:yes+reviewed:yes&offset=0&limit=15&format=xml

The meanings of arguments are written in this page(http://www.uniprot.org/faq/28).

Once you get the xml data you can extract any information from that as you like.

from xml.dom import minidom

dom = minidom.parse( xml_data )
entries = dom.getElementsByTagName("entry")

for entry in entries:
    # entry name
    ent_name = entry.getElementsByTagName("name")
    print ent_name[0].childNodes[0].data

    # Organism
    orgs = entry.getElementsByTagName("organism")
    org_names = orgs[0].getElementsByTagName("name")

    sciname = commname = ""
    for org_name in org_names:
        t = org_name.getAttribute("type")
        if "scientific" == t:
            sciname = org_name.childNodes[0].data
        elif "common" == t:
            commname = org_name.childNodes[0].data

    if len(commname)>0:
        sciname += ' (' + commname + ')'

    print "- ",sciname

    # PDB data
    pdbids = []
    dbrefs = entry.getElementsByTagName("dbReference")
    for dbref in dbrefs:
        t = dbref.getAttribute("type")
        if "PDB" == t:
            pdbids.append( dbref.getAttribute("id") )

    if len(pdbids)>0:
        print "-  PDBID: ",",".join(pdbids)

CCL18_HUMAN
– Homo sapiens (Human)
CDP1_ARATH
– Arabidopsis thaliana (Mouse-ear cress)
CUL9_HUMAN
– Homo sapiens (Human)
– PDBID: 2JUF
CUL9_MOUSE
– Mus musculus (Mouse)
DOIS_STRRY
– Streptomyces rimosus subsp. paromomycinus
GSTXC_TOBAC
– Nicotiana tabacum (Common tobacco)
ORC1_HUMAN
– Homo sapiens (Human)
PARC4_ECOLX
– Escherichia coli
PARC_BACSU
– Bacillus subtilis (strain 168)
PARC_BORBU
– Borrelia burgdorferi (strain ATCC 35210 / B31 / CIP 102532 / DSM 4680)
PARC_CAUCR
– Caulobacter crescentus (strain ATCC 19089 / CB15)
PARC_CROSK
– Cronobacter sakazakii
PARC_ECO57
– Escherichia coli O157:H7
PARC_ECOLI
– Escherichia coli (strain K12)
– PDBID: 1ZVT,1ZVU
PARC_HAEIN
– Haemophilus influenzae (strain ATCC 51907 / DSM 11121 / KW20 / Rd)

Posting queries to RCSB PDB

Posted on Updated on

The RCSB PDB is providing RESTful Web Service(http://www.rcsb.org/pdb/software/rest.do). You can execute advanced search by sending XML data. Some XML examples are shown by selecting query types from the pull-down at the bottom of the page. The page also shows an example code written by python (http://www.rcsb.org/pdb/software/static.do?p=/software/webservices/search_nmr.jsp). Although the code is written with urllib2, I rewrote the code with httplib2 so that I could keep the consistency with the previous code (And another reason was that urllib2 was felt difficult for me to handle the URL redirection ). If you use httplib2, you have to specify the http header strictly.

import httplib2

def get_pdbidlist_bykeyword( keyword ):
    url = "http://www.rcsb.org/pdb/rest/search"

    # XML block copy and pasted from RCSB PDB
    # No indentation is allowed
    queryText = """
<?xml version="1.0" encoding="UTF-8"?>
<orgPdbQuery>
<queryType>org.pdb.query.simple.AdvancedKeywordQuery</queryType>
<description>Text Search</description>
<keywords>%s</keywords>
</orgPdbQuery>
""" % keyword

    q_header = {'Content-type': 'application/x-www-form-urlencoded'}
    conn = httplib2.Http()
    (response, body) = conn.request( url, "POST", body = queryText, headers=q_header)

    pdbid_list = []
    http_status = response["status"]
    if( http_status == "200"):
        pdbid_list = body.strip().split('\n')
    else:
        print "HTTP Response: ", http_status

    return pdbid_list

Here is an usage example of this function.

pdbid_list = get_pdbidlist_bykeyword( "gyrA" )

for pid in pdbid_list:
    print '>', pid

> 1AB4
> 1AM2
> 1SUU
> 1X75
> 1ZI0
> 1ZVU
> 2BH8
> 2XCO
> 2XCQ
> 2XCR
> 2XCS
> 2XCT
> 2Y3P
> 2ZJT
> 3G7Z
> 3IFZ
> 3ILW
> 3L6V
> 3LPX
> 3NO0
> 3NUH
> 3TTZ
> 3UC1
> 4DDQ
> 4ELY
> 4ELZ

With download_pdbfile() function written in here(https://pachiras.wordpress.com/2013/03/24/downloading-pdb-files/) you can download all the pdb files related to the keyword you specify.

Downloading PDB files

Posted on Updated on

Downloading PDB files is one of my daily tasks. So I wrote this to make it easier.

import httplib2, StringIO, gzip

def download_pdbfile( pdbid ):
    if(len(pdbid) != 4 ):
        print "Invalid id: ", pdbid
        return

    url = 'http://www.rcsb.org/pdb/files/%s.pdb.gz' % pdbid
    conn = httplib2.Http()
    (response, body) = conn.request( url )
    enc_body = StringIO.StringIO( body )
    content = gzip.GzipFile( fileobj= enc_body ).read()

	local_file = open(pdbid+'.pdb','wb')
	local_file.write( content )
	local_file.close()