Package Bio :: Package EUtils :: Module ThinClient
[hide private]
[frames] | no frames]

Source Code for Module Bio.EUtils.ThinClient

   1  """Low-level interface to NCBI's EUtils for Entrez search and retrieval. 
   2   
   3  For higher-level interfaces, see DBIdsClient (which works with a set 
   4  of database identifiers) and HistoryClient (which does a much better 
   5  job of handling history). 
   6   
   7  There are five classes of services: 
   8    ESearch - search a database 
   9    EPost - upload a list of indicies for further use 
  10    ESummary - get document summaries for a given set of records 
  11    EFetch - get the records translated to a given format 
  12    ELink - find related records in other databases 
  13   
  14  You can find more information about them at 
  15    http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html 
  16  but that document isn't very useful.  Perhaps the following is better. 
  17   
  18  EUtils offers a structured way to query Entrez, get the results in 
  19  various formats, and get information about related documents.  The way 
  20  to start off is create an EUtils object. 
  21   
  22  >>> from Bio import EUtils 
  23  >>> from Bio.EUtils.ThinClient import ThinClient 
  24  >>> eutils = ThinClient.ThinClient() 
  25  >>>  
  26   
  27  You can search Entrez with the "esearch" method.  This does a query on 
  28  the server, which generates a list of identifiers for records that 
  29  matched the query.  However, not all the identifiers are returned. 
  30  You can request only a subset of the matches (using the 'retstart' and 
  31  'retmax') terms.  This is useful because searches like 'cancer' can 
  32  have over 1.4 million matches.  Most people would rather change the 
  33  query or look at more details about the first few hits than wait to 
  34  download all the identifiers before doing anything else. 
  35   
  36  The esearch method, and indeed all these methods, returns a 
  37  'urllib.addinfourl' which is an HTTP socket connection that has 
  38  already parsed the HTTP header and is ready to read the data from the 
  39  server. 
  40   
  41  For example, here's a query and how to use it 
  42   
  43    Search in PubMed for the term cancer for the entrez date from the 
  44    last 60 days and retrieve the first 10 IDs and translations using 
  45    the history parameter. 
  46   
  47  >>> infile = eutils.esearch("cancer", 
  48  ...                         daterange = EUtils.WithinNDays(60, "edat"), 
  49  ...                         retmax = 10) 
  50  >>> 
  51  >>> print infile.read() 
  52  <?xml version="1.0"?> 
  53  <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd"> 
  54  <eSearchResult> 
  55          <Count>7228</Count> 
  56          <RetMax>10</RetMax> 
  57          <RetStart>0</RetStart> 
  58          <IdList> 
  59                  <Id>12503096</Id> 
  60                  <Id>12503075</Id> 
  61                  <Id>12503073</Id> 
  62                  <Id>12503033</Id> 
  63                  <Id>12503030</Id> 
  64                  <Id>12503028</Id> 
  65                  <Id>12502932</Id> 
  66                  <Id>12502925</Id> 
  67                  <Id>12502881</Id> 
  68                  <Id>12502872</Id> 
  69          </IdList> 
  70          <TranslationSet> 
  71                  <Translation> 
  72                          <From>cancer%5BAll+Fields%5D</From> 
  73                          <To>(%22neoplasms%22%5BMeSH+Terms%5D+OR+cancer%5BText+Word%5D)</To> 
  74                  </Translation> 
  75          </TranslationSet> 
  76          <TranslationStack> 
  77                  <TermSet> 
  78                          <Term>"neoplasms"[MeSH Terms]</Term> 
  79                          <Field>MeSH Terms</Field> 
  80                          <Count>1407151</Count> 
  81                          <Explode>Y</Explode> 
  82                  </TermSet> 
  83                  <TermSet> 
  84                          <Term>cancer[Text Word]</Term> 
  85                          <Field>Text Word</Field> 
  86                          <Count>382919</Count> 
  87                          <Explode>Y</Explode> 
  88                  </TermSet> 
  89                  <OP>OR</OP> 
  90                  <TermSet> 
  91                          <Term>2002/10/30[edat]</Term> 
  92                          <Field>edat</Field> 
  93                          <Count>-1</Count> 
  94                          <Explode>Y</Explode> 
  95                  </TermSet> 
  96                  <TermSet> 
  97                          <Term>2002/12/29[edat]</Term> 
  98                          <Field>edat</Field> 
  99                          <Count>-1</Count> 
 100                          <Explode>Y</Explode> 
 101                  </TermSet> 
 102                  <OP>RANGE</OP> 
 103                  <OP>AND</OP> 
 104          </TranslationStack> 
 105  </eSearchResult> 
 106   
 107  >>> 
 108   
 109  You get a raw XML input stream which you can process in many ways. 
 110  (The appropriate DTDs are included in the subdirectory "DTDs" and see 
 111  also the included POM reading code.) 
 112   
 113      WARNING! As of this writing (2002/12/3) NCBI returns their 
 114      XML encoded as Latin-1 but their processing instruction says 
 115      it is UTF-8 because they leave out the "encoding" attribute. 
 116      Until they fix it you will need to recode the input stream 
 117      before processing it with XML tools, like this 
 118   
 119          import codecs 
 120          infile = codecs.EncodedFile(infile, "utf-8", "iso-8859-1") 
 121   
 122   
 123  The XML fields are mostly understandable: 
 124    Count -- the total number of matches from this search 
 125    RetMax -- the number of <ID> values returned in this subset 
 126    RetStart -- the start position of this subset in the list of 
 127        all matches 
 128   
 129    IDList and ID -- the identifiers in this subset 
 130   
 131    TranslationSet / Translation -- if the search field is not 
 132        explicitly specified ("qualified"), then the server will 
 133        apply a set of hueristics to improve the query.  Eg, in 
 134        this case "cancer" is first parsed as 
 135          cancer[All Fields] 
 136        then turned into the query 
 137          "neoplasms"[MeSH Terms] OR cancer[Text Word] 
 138   
 139        Note that these terms are URL escaped. 
 140        For details on how the translation is done, see 
 141  http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#AutomaticTermMapping 
 142   
 143    TranslationStack -- The (possibly 'improved' query) fully 
 144        parsed out and converted into a postfix (RPN) notation. 
 145        The above example is written in the Entrez query language as 
 146   
 147          ("neoplasms"[MeSH Terms] OR cancer[Text Word]) AND 
 148                       2002/10/30:2002/12/29[edat] 
 149        Note that these terms are *not* URL escaped.  Nothing like 
 150        a bit of inconsistency for the soul. 
 151   
 152        The "Count" field shows how many matches were found for each 
 153        term of the expression.  I don't know what "Explode" does. 
 154   
 155   
 156  Let's get more information about the first record, which has an id of 
 157  12503096.  There are two ways to query for information, one uses a set 
 158  of identifiers and the other uses the history.  I'll talk about the 
 159  history one in a bit.  To use a set of identifiers you need to make a 
 160  DBIds object containing the that list. 
 161   
 162  >>> dbids = EUtils.DBIds("pubmed", ["12503096"]) 
 163  >>> 
 164   
 165  Now get the summary using dbids 
 166   
 167  >>> infile = eutils.esummary_using_dbids(dbids) 
 168  >>> print infile.read() 
 169  <?xml version="1.0"?> 
 170  <!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD eSummaryResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSummary_020511.dtd"> 
 171  <eSummaryResult> 
 172  <DocSum> 
 173          <Id>12503096</Id> 
 174          <Item Name="PubDate" Type="Date">2003 Jan 30</Item> 
 175          <Item Name="Source" Type="String">Am J Med Genet</Item> 
 176          <Item Name="Authors" Type="String">Coyne JC, Kruus L, Racioppo M, Calzone KA, Armstrong K</Item> 
 177          <Item Name="Title" Type="String">What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer?</Item> 
 178          <Item Name="Volume" Type="String">116</Item> 
 179          <Item Name="Pages" Type="String">222-8</Item> 
 180          <Item Name="EntrezDate" Type="Date">2002/12/28 04:00</Item> 
 181          <Item Name="PubMedId" Type="Integer">12503096</Item> 
 182          <Item Name="MedlineId" Type="Integer">22390532</Item> 
 183          <Item Name="Lang" Type="String">English</Item> 
 184          <Item Name="PubType" Type="String"></Item> 
 185          <Item Name="RecordStatus" Type="String">PubMed - in process</Item> 
 186          <Item Name="Issue" Type="String">3</Item> 
 187          <Item Name="SO" Type="String">2003 Jan 30;116(3):222-8</Item> 
 188          <Item Name="DOI" Type="String">10.1002/ajmg.a.10844</Item> 
 189          <Item Name="JTA" Type="String">3L4</Item> 
 190          <Item Name="ISSN" Type="String">0148-7299</Item> 
 191          <Item Name="PubId" Type="String"></Item> 
 192          <Item Name="PubStatus" Type="Integer">4</Item> 
 193          <Item Name="Status" Type="Integer">5</Item> 
 194          <Item Name="HasAbstract" Type="Integer">1</Item> 
 195          <Item Name="ArticleIds" Type="List"> 
 196                  <Item Name="PubMedId" Type="String">12503096</Item> 
 197                  <Item Name="DOI" Type="String">10.1002/ajmg.a.10844</Item> 
 198                  <Item Name="MedlineUID" Type="String">22390532</Item> 
 199          </Item> 
 200  </DocSum> 
 201  </eSummaryResult> 
 202  >>> 
 203   
 204  This is just a summary.  To get the full details, including an 
 205  abstract (if available) use the 'efetch' method.  I'll only print a 
 206  bit to convince you it has an abstract. 
 207   
 208  >>> s = eutils.efetch_using_dbids(dbids).read() 
 209  >>> print s[587:860] 
 210  <ArticleTitle>What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer?</ArticleTitle> 
 211  <Pagination> 
 212  <MedlinePgn>222-8</MedlinePgn> 
 213  </Pagination> 
 214  <Abstract> 
 215  <AbstractText>Women recruited from a hereditary cancer registry provided 
 216  >>> 
 217   
 218  Suppose instead you want the data in a text format.  Different 
 219  databases have different text formats.  For example, PubMed has a 
 220  "docsum" format which gives just the summary of a document and 
 221  "medline" format as needed for a citation database.  To get these, use 
 222  a "text" "retmode" ("return mode") and select the appropriate 
 223  "rettype" ("return type"). 
 224   
 225  Here are examples of those two return types 
 226   
 227  >>> print eutils.efetch_using_dbids(dbids, "text", "docsum").read()[:497] 
 228  1:  Coyne JC, Kruus L, Racioppo M, Calzone KA, Armstrong K. 
 229  What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer? 
 230  Am J Med Genet. 2003 Jan 30;116(3):222-8. 
 231  PMID: 12503096 [PubMed - in process] 
 232  >>> print eutils.efetch_using_dbids(dbids, "text", "medline").read()[:369] 
 233  UI  - 22390532 
 234  PMID- 12503096 
 235  DA  - 20021227 
 236  IS  - 0148-7299 
 237  VI  - 116 
 238  IP  - 3 
 239  DP  - 2003 Jan 30 
 240  TI  - What do ratings of cancer-specific distress mean among women at high risk 
 241        of breast and ovarian cancer? 
 242  PG  - 222-8 
 243  AB  - Women recruited from a hereditary cancer registry provided ratings of 
 244        distress associated with different aspects of high-risk status 
 245  >>>  
 246   
 247  It's also possible to get a list of records related to a given 
 248  article.  This is done through the "elink" method.  For example, 
 249  here's how to get the list of PubMed articles related to the above 
 250  PubMed record.  (Again, truncated because otherwise there is a lot of 
 251  data.) 
 252   
 253  >>> print eutils.elink_using_dbids(dbids).read()[:590] 
 254  <?xml version="1.0"?> 
 255  <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd"> 
 256  <eLinkResult> 
 257  <LinkSet> 
 258          <DbFrom>pubmed</DbFrom> 
 259          <IdList> 
 260                  <Id>12503096</Id> 
 261          </IdList> 
 262          <LinkSetDb> 
 263                  <DbTo>pubmed</DbTo> 
 264                  <LinkName>pubmed_pubmed</LinkName> 
 265                  <Link> 
 266                          <Id>12503096</Id> 
 267                          <Score>2147483647</Score> 
 268                  </Link> 
 269                  <Link> 
 270                          <Id>11536413</Id> 
 271                          <Score>30817790</Score> 
 272                  </Link> 
 273                  <Link> 
 274                          <Id>11340606</Id> 
 275                          <Score>29939219</Score> 
 276                  </Link> 
 277                  <Link> 
 278                          <Id>10805955</Id> 
 279                          <Score>29584451</Score> 
 280                  </Link> 
 281  >>> 
 282   
 283  For a change of pace, let's work with the protein database to learn 
 284  how to work with history.  Suppose I want to do a multiple sequene 
 285  alignment of bacteriorhodopsin with all of its neighbors, where 
 286  "neighbors" is defined by NCBI.  There are good programs for this -- I 
 287  just need to get the records in the right format, like FASTA. 
 288   
 289  The bacteriorhodopsin I'm interested in is BAA75200, which is 
 290  GI:4579714, so I'll start by asking for its neighbors. 
 291   
 292  >>> results = eutils.elink_using_dbids( 
 293  ...             EUtils.DBIds("protein", ["4579714"]), 
 294  ...             db = "protein").read() 
 295  >>> print results[:454] 
 296  <?xml version="1.0"?> 
 297  <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd"> 
 298  <eLinkResult> 
 299  <LinkSet> 
 300          <DbFrom>protein</DbFrom> 
 301          <IdList> 
 302                  <Id>4579714</Id> 
 303          </IdList> 
 304          <LinkSetDb> 
 305                  <DbTo>protein</DbTo> 
 306                  <LinkName>protein_protein</LinkName> 
 307                  <Link> 
 308                          <Id>4579714</Id> 
 309                          <Score>2147483647</Score> 
 310                  </Link> 
 311                  <Link> 
 312                          <Id>11277596</Id> 
 313                          <Score>1279</Score> 
 314                  </Link> 
 315  >>> 
 316   
 317  Let's get all the <Id> fields.  (While the following isn't a good way 
 318  to parse XML, it is easy to understand and works well enough for this 
 319  example.)  Note that I remove the first <Id> because that's from the 
 320  query and not from the results. 
 321   
 322  >>> import re 
 323  >>> ids = re.findall(r"<Id>(\d+)</Id>", results) 
 324  >>> ids = ids[1:] 
 325  >>> len(ids) 
 326  222 
 327  >>> dbids = EUtils.DBIds("protein", ids) 
 328  >>>  
 329   
 330  That's a lot of records.  I could use 'efetch_using_dbids' but there's 
 331  a problem with that.  Efetch uses the HTTP GET protocol to pass 
 332  information to the EUtils server.  ("GET" is what's used when you type 
 333  a URL in the browser.)  Each id takes about 9 characters, so the URL 
 334  would be over 2,000 characters long.  This may not work on some 
 335  systems, for example, some proxies do not support long URLs.  (Search 
 336  for "very long URLs" for examples.) 
 337   
 338  Instead, we'll upload the list to the server then fetch the FASTA 
 339  version using the history. 
 340   
 341  The first step is to upload the data.  We want to put that into the 
 342  history so we set 'usehistory' to true.  There's no existing history 
 343  so the webenv string is None. 
 344   
 345   
 346  >>> print eutils.epost(dbids, usehistory = 1, webenv = None).read() 
 347  <?xml version="1.0"?> 
 348  <!DOCTYPE ePostResult PUBLIC "-//NLM//DTD ePostResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ePost_020511.dtd"> 
 349  <ePostResult> 
 350          <QueryKey>1</QueryKey> 
 351          <WebEnv>%7BPgTHRHFBsJfC%3C%5C%5C%5B%3EAfJCKQ%5Ey%60%3CGkH%5DH%5E%3DJHGBKAJ%3F%40CbCiG%3FE%3C</WebEnv> 
 352  </ePostResult> 
 353   
 354  >>> 
 355   
 356  This says that the identifiers were saved as query #1, which will be 
 357  used later on as the "query_key" field.  The WebEnv is a cookie (or 
 358  token) used to tell the server where to find that query.  The WebEnv 
 359  changes after every history-enabled ESearch or EPost so you'll need to 
 360  parse the output from those to get the new WebEnv field.  You'll also 
 361  need to unquote it since it is URL-escaped. 
 362   
 363  Also, you will need to pass in the name of the database used for the 
 364  query in order to access the history.  Why?  I don't know -- I figure 
 365  the WebEnv and query_key should be enough to get the database name. 
 366   
 367  >>> import urllib 
 368  >>> webenv = urllib.unquote("%7BPgTHRHFBsJfC%3C%5C%5C%5B%3EAfJCKQ%5Ey%60%3CGkH%5DH%5E%3DJHGBKAJ%3F%40CbCiG%3FE%3C") 
 369  >>> print webenv 
 370  {PgTHRHFBsJfC<\\[>AfJCKQ^y`<GkH]H^=JHGBKAJ?@CbCiG?E< 
 371  >>> 
 372   
 373  Okay, now to get the data in FASTA format.  Notice that I need the 
 374  'retmax' in order to include all the records in the result.  (The 
 375  default is 20 records.) 
 376   
 377  >>> fasta = eutils.efetch_using_history("protein", webenv, query_key = "1", 
 378  ...                                     retmode = "text", rettype = "fasta", 
 379  ...                                     retmax = len(dbids)).read() 
 380  >>> fasta.count(">") 
 381  222 
 382  >>> print fasta[:694] 
 383  >gi|14194475|sp|O93742|BACH_HALSD Halorhodopsin (HR) 
 384  MMETAADALASGTVPLEMTQTQIFEAIQGDTLLASSLWINIALAGLSILLFVYMGRNLEDPRAQLIFVAT 
 385  LMVPLVSISSYTGLVSGLTVSFLEMPAGHALAGQEVLTPWGRYLTWALSTPMILVALGLLAGSNATKLFT 
 386  AVTADIGMCVTGLAAALTTSSYLLRWVWYVISCAFFVVVLYVLLAEWAEDAEVAGTAEIFNTLKLLTVVL 
 387  WLGYPIFWALGAEGLAVLDVAVTSWAYSGMDIVAKYLFAFLLLRWVVDNERTVAGMAAGLGAPLARCAPA 
 388  DD 
 389  >gi|14194474|sp|O93741|BACH_HALS4 Halorhodopsin (HR) 
 390  MRSRTYHDQSVCGPYGSQRTDCDRDTDAGSDTDVHGAQVATQIRTDTLLHSSLWVNIALAGLSILVFLYM 
 391  ARTVRANRARLIVGATLMIPLVSLSSYLGLVTGLTAGPIEMPAAHALAGEDVLSQWGRYLTWTLSTPMIL 
 392  LALGWLAEVDTADLFVVIAADIGMCLTGLAAALTTSSYAFRWAFYLVSTAFFVVVLYALLAKWPTNAEAA 
 393  GTGDIFGTLRWLTVILWLGYPILWALGVEGFALVDSVGLTSWGYSLLDIGAKYLFAALLLRWVANNERTI 
 394  AVGQRSGRGAIGDPVED 
 395  >>>  
 396   
 397  To round things out, here's a query which refines the previous query. 
 398  I want to get all records from the first search which also have the 
 399  word "Structure" in them.  (My background was originally structural 
 400  biophysics, whaddya expect?  :) 
 401   
 402  >>> print eutils.search("#1 AND structure", db = "protein", usehistory = 1, 
 403  ...                     webenv = webenv).read() 
 404  <?xml version="1.0"?> 
 405  <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd"> 
 406  <eSearchResult> 
 407          <Count>67</Count> 
 408          <RetMax>20</RetMax> 
 409          <RetStart>0</RetStart> 
 410          <QueryKey>2</QueryKey> 
 411          <WebEnv>UdvMf%3F%60G%3DIE%60bG%3DGec%3E%3D%3Cbc_%5DgBAf%3EAi_e%5EAJcHgDi%3CIqGdE%7BmC%3C</WebEnv> 
 412          <IdList> 
 413                  <Id>461608</Id> 
 414                  <Id>114808</Id> 
 415                  <Id>1364150</Id> 
 416                  <Id>1363466</Id> 
 417                  <Id>1083906</Id> 
 418                  <Id>99232</Id> 
 419                  <Id>99212</Id> 
 420                  <Id>81076</Id> 
 421                  <Id>114811</Id> 
 422                  <Id>24158915</Id> 
 423                  <Id>24158914</Id> 
 424                  <Id>24158913</Id> 
 425                  <Id>1168615</Id> 
 426                  <Id>114812</Id> 
 427                  <Id>114809</Id> 
 428                  <Id>17942995</Id> 
 429                  <Id>17942994</Id> 
 430                  <Id>17942993</Id> 
 431                  <Id>20151159</Id> 
 432                  <Id>20150922</Id> 
 433          </IdList> 
 434          <TranslationSet> 
 435          </TranslationSet> 
 436          <TranslationStack> 
 437                  <TermSet> 
 438                          <Term>#1</Term> 
 439                          <Field>All Fields</Field> 
 440                          <Count>222</Count> 
 441                          <Explode>Y</Explode> 
 442                  </TermSet> 
 443                  <TermSet> 
 444                          <Term>structure[All Fields]</Term> 
 445                          <Field>All Fields</Field> 
 446                          <Count>142002</Count> 
 447                          <Explode>Y</Explode> 
 448                  </TermSet> 
 449                  <OP>AND</OP> 
 450          </TranslationStack> 
 451  </eSearchResult> 
 452   
 453  >>>  
 454   
 455  One last thing about history.  It doesn't last very long -- perhaps an 
 456  hour or so.  (Untested.)  You may be able to toss it some keep-alive 
 457  signal every once in a while.  Or you may want to keep  
 458   
 459  The known 'db' fields and primary IDs (if known) are 
 460    genome -- GI number 
 461    nucleotide -- GI number 
 462    omim  -- MIM number 
 463    popset -- GI number 
 464    protein -- GI number 
 465    pubmed  -- PMID 
 466    sequences (not available; this will combine all sequence databases) 
 467    structure -- MMDB ID 
 468    taxonomy -- TAXID 
 469   
 470  The 'field' parameter is different for different databases.  The 
 471  fields for PubMed are listed at 
 472   
 473  http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#SearchFieldDescriptionsandTags 
 474   
 475    Affiliation -- AD 
 476    All Fields -- All 
 477    Author -- AU 
 478    EC/RN Number -- RN 
 479    Entrez Date -- EDAT  (also valid for 'datetype') 
 480    Filter -- FILTER 
 481    Issue -- IP 
 482    Journal Title -- TA 
 483    Language -- LA 
 484    MeSH Date -- MHDA  (also valid for 'datetype') 
 485    MeSH Major Topic -- MAJR 
 486    MeSH Subheadings -- SH 
 487    MeSH Terms -- MH 
 488    Pagination -- PG 
 489    Personal Name as Subject -- PS 
 490    Publication Date -- DP  (also valid for 'datetype') 
 491    Publication Type -- PT 
 492    Secondary Source ID -- SI 
 493    Subset -- SB 
 494    Substance Name -- NM 
 495    Text Words -- TW 
 496    Title -- TI 
 497    Title/Abstract -- TIAB 
 498    Unique Identifiers -- UID 
 499    Volume -- VI 
 500   
 501  The fields marked as 'datetype' can also be used for date searches. 
 502  Date searches can be done in the query (for example, as 
 503   
 504     1990/01/01:1999/12/31[edat] 
 505   
 506  or by passing a WithinNDays or DateRange field to the 'date' parameter 
 507  of the search. 
 508   
 509   
 510  Please pay attention to the usage limits!  The are listed at 
 511    http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html 
 512   
 513  At the time of this writing they are: 
 514      * Run retrieval scripts on weekends or between 9 PM and 5 AM ET 
 515              weekdays for any series of more than 100 requests. 
 516      * Make no more than one request every 3 seconds. 
 517      * Only 5000 PubMed records may be retrieved in a single day. 
 518   
 519      * NCBI's Disclaimer and Copyright notice must be evident to users 
 520        of your service.  NLM does not hold the copyright on the PubMed 
 521        abstracts the journal publishers do.  NLM provides no legal 
 522        advice concerning distribution of copyrighted materials, consult 
 523        your legal counsel. 
 524   
 525  (Their disclaimer is at 
 526         http://www.ncbi.nlm.nih.gov/About/disclaimer.html ) 
 527   
 528   
 529  """ # "  # Emacs cruft 
 530   
 531  import urllib, urllib2, cStringIO 
 532   
 533  DUMP_URL = 0 
 534  DUMP_RESULT = 0 
 535   
 536  # These tell NCBI who is using the tool.  They are meant to provide 
 537  # hints to NCBI about how their service is being used and provide a 
 538  # means of getting ahold of the author. 
 539  # 
 540  # To use your own values, pass them in to the EUtils constructor. 
 541  # 
 542  TOOL = "EUtils_Python_client" 
 543  EMAIL = "biopython-dev@biopython.org" 
 544   
 545  assert " " not in TOOL 
 546  assert " " not in EMAIL 
 547   
548 -def _dbids_to_id_string(dbids):
549 """Internal function: convert a list of ids to a comma-seperated string""" 550 # NOTE: the server strips out non-numeric characters 551 # Eg, "-1" is treated as "1". So do some sanity checking. 552 # XXX Should I check for non-digits? 553 # Are any of the IDs non-integers? 554 if not dbids: 555 raise TypeError("dbids list must have at least one term") 556 for x in dbids.ids: 557 if "," in x: 558 raise TypeError("identifiers cannot contain a comma: %r " % 559 (x,)) 560 id_string = ",".join(dbids.ids) 561 assert id_string.count(",") == len(dbids.ids)-1, "double checking" 562 return id_string
563
564 -class ThinClient:
565 """Client-side interface to the EUtils services 566 567 See the module docstring for much more complete information. 568 """
569 - def __init__(self, 570 opener = None, 571 tool = TOOL, 572 email = EMAIL, 573 baseurl = "http://www.ncbi.nlm.nih.gov/entrez/eutils/"):
574 """opener = None, tool = TOOL, email = EMAIL, baseurl = ".../eutils/" 575 576 'opener' -- an object which implements the 'open' method like a 577 urllib2.OpenDirector. Defaults to urllib2.build_opener() 578 579 'tool' -- the term to use for the 'tool' field, used by NCBI to 580 track which programs use their services. If you write your 581 own tool based on this package, use your own tool name. 582 583 'email' -- a way for NCBI to contact you (the developer, not 584 the user!) if there are problems and to tell you about 585 updates or changes to their system. 586 587 'baseurl' -- location of NCBI's EUtils directory. Shouldn't need 588 to change this at all. 589 """ 590 591 if tool is not None and " " in tool: 592 raise TypeError("No spaces allowed in 'tool'") 593 if email is not None and " " in email: 594 raise TypeError("No spaces allowed in 'email'") 595 596 if opener is None: 597 opener = urllib2.build_opener() 598 599 self.opener = opener 600 self.tool = tool 601 self.email = email 602 self.baseurl = baseurl
603
604 - def _fixup_query(self, query):
605 """Internal function to add and remove fields from a query""" 606 q = query.copy() 607 608 # Set the 'tool' and 'email' fields 609 q["tool"] = self.tool 610 q["email"] = self.email 611 612 # Kinda cheesy -- shouldn't really do this here. 613 # If 'usehistory' is true, use the value of 'Y' instead. 614 # Otherwise, don't use history 615 if "usehistory" in q: 616 if q["usehistory"]: 617 q["usehistory"] = "y" 618 else: 619 q["usehistory"] = None 620 621 # This will also remove the history, email, etc. fields 622 # if they are set to None. 623 for k, v in q.items(): 624 if v is None: 625 del q[k] 626 627 # Convert the query into the form needed for a GET. 628 return urllib.urlencode(q)
629
630 - def _get(self, program, query):
631 """Internal function: send the query string to the program as GET""" 632 # NOTE: epost uses a different interface 633 634 q = self._fixup_query(query) 635 url = self.baseurl + program + "?" + q 636 if DUMP_URL: 637 print "Opening with GET:", url 638 if DUMP_RESULT: 639 print " ================== Results ============= " 640 s = self.opener.open(url).read() 641 print s 642 print " ================== Finished ============ " 643 return cStringIO.StringIO(s) 644 return self.opener.open(url)
645
646 - def esearch(self, 647 term, # In Entrez query language 648 db = "pubmed", # Required field, default to PubMed 649 field = None, # Field to use for unqualified words 650 daterange = None, # Date restriction 651 652 retstart = 0, 653 retmax = 20, # Default from NCBI is 20, so I'll use that 654 655 usehistory = 0, # Enable history tracking 656 webenv = None, # If given, add to an existing history 657 ):
658 659 """term, db="pubmed", field=None, daterange=None, retstart=0, retmax=20, usehistory=0, webenv=none 660 661 Search the given database for records matching the query given 662 in the 'term'. See the module docstring for examples. 663 664 'term' -- the query string in the Entrez query language; see 665 http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html 666 'db' -- the database to search 667 668 'field' -- the field to use for unqualified words 669 Eg, "dalke[au] AND gene" with field==None becomes 670 dalke[au] AND (genes[MeSH Terms] OR gene[Text Word] 671 and "dalke[au] AND gene" with field=="au" becomes 672 dalke[au] AND genes[Author] 673 (Yes, I think the first "au" should be "Author" too) 674 675 'daterange' -- a date restriction; either WithinNDays or DateRange 676 'retstart' -- include identifiers in the output, starting with 677 position 'retstart' (normally starts with 0) 678 'retmax' -- return at most 'retmax' identifiers in the output 679 (if not specified, NCBI returns 20 identifiers) 680 681 'usehistory' -- flag to enable history tracking 682 'webenv' -- if this string is given, add the search results 683 to an existing history. (WARNING: the history disappers 684 after about an hour of non-use.) 685 686 You will need to parse the output XML to get the new QueryKey 687 and WebEnv fields. 688 689 Returns an input stream from an HTTP request. The stream 690 contents are in XML. 691 """ 692 query = {"term": term, 693 "db": db, 694 "field": field, 695 "retstart": retstart, 696 "retmax": retmax, 697 "usehistory": usehistory, 698 "WebEnv": webenv, 699 } 700 if daterange is not None: 701 query.update(daterange.get_query_params()) 702 703 return self._get(program = "esearch.fcgi", query = query)
704
705 - def epost(self, 706 dbids, 707 708 webenv = None, # If given, add to an existing history 709 ):
710 """dbids, webenv = None 711 712 Create a new collection in the history containing the given 713 list of identifiers for a database. 714 715 'dbids' -- a DBIds, which contains the database name and 716 a list of identifiers in that database 717 'webenv' -- if this string is given, add the collection 718 to an existing history. (WARNING: the history disappers 719 after about an hour of non-use.) 720 721 You will need to parse the output XML to get the new QueryKey 722 and WebEnv fields. NOTE: The order of the IDs on the server 723 is NOT NECESSARILY the same as the upload order. 724 725 Returns an input stream from an HTTP request. The stream 726 contents are in XML. 727 """ 728 id_string = _dbids_to_id_string(dbids) 729 730 # Looks like it will accept *any* ids. Wonder what that means. 731 program = "epost.fcgi" 732 query = {"id": id_string, 733 "db": dbids.db, 734 "WebEnv": webenv, 735 } 736 q = self._fixup_query(query) 737 738 # Need to use a POST since the data set can be *very* long; 739 # even too long for GET. 740 if DUMP_URL: 741 print "Opening with POST:", self.baseurl + program + "?" + q 742 if DUMP_RESULT: 743 print " ================== Results ============= " 744 s = self.opener.open(self.baseurl + program, q).read() 745 print s 746 print " ================== Finished ============ " 747 return cStringIO.StringIO(s) 748 return self.opener.open(self.baseurl + program, q)
749
750 - def esummary_using_history(self, 751 db, # This is required. Don't use a 752 # default here because it must match 753 # that of the webenv 754 webenv, 755 query_key, 756 retstart = 0, 757 retmax = 20, 758 retmode = "xml", # any other modes? 759 ):
760 """db, webenv, query_key, retstart = 0, retmax = 20, retmode = "xml" 761 762 Get the summary for a collection of records in the history 763 764 'db' -- the database containing the history/collection 765 'webenv' -- the WebEnv cookie for the history 766 'query_key' -- the collection in the history 767 'retstart' -- get the summaries starting with this position 768 'retmax' -- get at most this many summaries 769 'retmode' -- can only be 'xml'. (Are there others?) 770 771 Returns an input stream from an HTTP request. The stream 772 contents are in 'retmode' format. 773 """ 774 return self._get(program = "esummary.fcgi", 775 query = {"db": db, 776 "WebEnv": webenv, 777 "query_key": query_key, 778 "retstart": retstart, 779 "retmax": retmax, 780 "retmode": retmode, 781 })
782
783 - def esummary_using_dbids(self, 784 dbids, 785 retmode = "xml", # any other modes? 786 ):
787 """dbids, retmode = "xml" 788 789 Get the summary for records specified by identifier 790 791 'dbids' -- a DBIds containing the database name and list 792 of record identifiers 793 'retmode' -- can only be 'xml' 794 795 Returns an input stream from an HTTP request. The stream 796 contents are in 'retmode' format. 797 """ 798 799 id_string = _dbids_to_id_string(dbids) 800 return self._get(program = "esummary.fcgi", 801 query = {"id": id_string, 802 "db": dbids.db, 803 # "retmax": len(dbids.ids), # needed? 804 "retmode": retmode, 805 })
806
807 - def efetch_using_history(self, 808 db, 809 webenv, 810 query_key, 811 812 retstart = 0, 813 retmax = 20, 814 815 retmode = None, 816 rettype = None, 817 818 # sequence only 819 seq_start = None, 820 seq_stop = None, 821 strand = None, 822 complexity = None, 823 ):
824 """db, webenv, query_key, retstart=0, retmax=20, retmode=None, rettype=None, seq_start=None, seq_stop=None, strand=None, complexity=None 825 826 Fetch information for a collection of records in the history, 827 in a variety of formats. 828 829 'db' -- the database containing the history/collection 830 'webenv' -- the WebEnv cookie for the history 831 'query_key' -- the collection in the history 832 'retstart' -- get the formatted data starting with this position 833 'retmax' -- get data for at most this many records 834 835 These options work for sequence databases 836 837 'seq_start' -- return the sequence starting at this position. 838 The first position is numbered 1 839 'seq_stop' -- return the sequence ending at this position 840 Includes the stop position, so seq_start = 1 and 841 seq_stop = 5 returns the first 5 bases/residues. 842 'strand' -- strand. Use EUtils.PLUS_STRAND (== 1) for plus 843 strand and EUtils.MINUS_STRAND (== 2) for negative 844 'complexity' -- regulates the level of display. Options are 845 0 - get the whole blob 846 1 - get the bioseq for gi of interest (default in Entrez) 847 2 - get the minimal bioseq-set containing the gi of interest 848 3 - get the minimal nuc-prot containing the gi of interest 849 4 - get the minimal pub-set containing the gi of interest 850 851 http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html 852 853 The valid retmode and rettype values are 854 855 For publication databases (omim, pubmed, journals) the 856 retmodes are 'xml', 'asn.1', 'text', and 'html'. 857 858 If retmode == xml ---> XML (default) 859 if retmode == asn.1 ---> ASN.1 860 861 The following rettype values work for retmode == 'text'. 862 863 docsum ----> author / title / cite / PMID 864 brief ----> a one-liner up to about 66 chars 865 abstract ----> cite / title / author / dept / 866 full abstract / PMID 867 citation ----> cite / title / author / dept / 868 full abstract / MeSH terms / 869 substances / PMID 870 medline ----> full record in medline format 871 asn.1 ----> full record in one ASN.1 format 872 mlasn1 ----> full record in another ASN.1 format 873 uilist ----> list of uids, one per line 874 sgml ----> same as retmode="xml" 875 876 Sequence databases (genome, protein, nucleotide, popset) 877 also have retmode values of 'xml', 'asn.1', 'text', and 878 'html'. 879 880 If retmode == 'xml' ---> XML (default; only supports 881 rettype == 'native') 882 If retmode == 'asn.1' ---> ASN.1 text (only works for rettype 883 of 'native' and 'sequin') 884 885 The following work with a retmode of 'text' or 'html' 886 887 native ----> Default format for viewing sequences 888 fasta ----> FASTA view of a sequence 889 gb ----> GenBank view for sequences, constructed sequences 890 will be shown as contigs (by pointing to its parts). 891 Valid for nucleotides. 892 gbwithparts --> GenBank view for sequences, the sequence will 893 always be shown. Valid for nucleotides. 894 est ----> EST Report. Valid for sequences from 895 dbEST database. 896 gss ----> GSS Report. Valid for sequences from dbGSS 897 database. 898 gp ----> GenPept view. Valid for proteins. 899 seqid ----> To convert list of gis into list of seqids 900 acc ----> To convert list of gis into list of accessions 901 902 # XXX TRY THESE 903 fasta_xml 904 gb_xml 905 gi (same as uilist?) 906 907 908 909 A retmode of 'file' is the same as 'text' except the data is 910 sent with a Content-Type of application/octet-stream, which tells 911 the browser to save the data to a file. 912 913 A retmode of 'html' is the same as 'text' except a HTML header 914 and footer are added and special character are properly escaped. 915 916 Returns an input stream from an HTTP request. The stream 917 contents are in the requested format. 918 """ 919 920 # NOTE: found the list of possible values by sending illegal 921 # parameters, to see which comes up as an error message. Used 922 # that to supplement the information from the documentation. 923 # Looks like efetch is based on pmfetch code and uses the same 924 # types. 925 926 # if retmax is specified and larger than 500, NCBI only returns 927 # 500 sequences. Removing it from the URL relieves this constraint. 928 # To get around this, if retstart is 0 and retmax is greater than 500, 929 # we set retmax to be None. 930 if retstart == 0 and retmax > 500: 931 retmax = None 932 return self._get(program = "efetch.fcgi", 933 query = {"db": db, 934 "WebEnv": webenv, 935 "query_key": query_key, 936 "retstart": retstart, 937 "retmax": retmax, 938 "retmode": retmode, 939 "rettype": rettype, 940 "seq_start": seq_start, 941 "seq_stop": seq_stop, 942 "strand": strand, 943 "complexity": complexity, 944 })
945
946 - def efetch_using_dbids(self, 947 dbids, 948 retmode = None, 949 rettype = None, 950 951 # sequence only 952 seq_start = None, 953 seq_stop = None, 954 strand = None, 955 complexity = None, 956 ):
957 """dbids, retmode = None, rettype = None, seq_start = None, seq_stop = None, strand = None, complexity = None 958 959 Fetch information for records specified by identifier 960 961 'dbids' -- a DBIds containing the database name and list 962 of record identifiers 963 'retmode' -- See the docstring for 'efetch_using_history' 964 'rettype' -- See the docstring for 'efetch_using_history' 965 966 These options work for sequence databases 967 968 'seq_start' -- return the sequence starting at this position. 969 The first position is numbered 1 970 'seq_stop' -- return the sequence ending at this position 971 Includes the stop position, so seq_start = 1 and 972 seq_stop = 5 returns the first 5 bases/residues. 973 'strand' -- strand. Use EUtils.PLUS_STRAND (== 1) for plus 974 strand and EUtils.MINUS_STRAND (== 2) for negative 975 'complexity' -- regulates the level of display. Options are 976 0 - get the whole blob 977 1 - get the bioseq for gi of interest (default in Entrez) 978 2 - get the minimal bioseq-set containing the gi of interest 979 3 - get the minimal nuc-prot containing the gi of interest 980 4 - get the minimal pub-set containing the gi of interest 981 982 Returns an input stream from an HTTP request. The stream 983 contents are in the requested format. 984 """ 985 id_string = _dbids_to_id_string(dbids) 986 return self._get(program = "efetch.fcgi", 987 query = {"id": id_string, 988 "db": dbids.db, 989 # "retmax": len(dbids.ids), # needed? 990 "retmode": retmode, 991 "rettype": rettype, 992 "seq_start": seq_start, 993 "seq_stop": seq_stop, 994 "strand": strand, 995 "complexity": complexity, 996 })
997 1107
1159