1 """Low-level interface to NCBI's EUtils for Entrez search and retrieval.
2
3 For higher-level interfaces, see DBIdsClient (which works with a set
4 of database identifiers) and HistoryClient (which does a much better
5 job of handling history).
6
7 There are five classes of services:
8 ESearch - search a database
9 EPost - upload a list of indicies for further use
10 ESummary - get document summaries for a given set of records
11 EFetch - get the records translated to a given format
12 ELink - find related records in other databases
13
14 You can find more information about them at
15 http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
16 but that document isn't very useful. Perhaps the following is better.
17
18 EUtils offers a structured way to query Entrez, get the results in
19 various formats, and get information about related documents. The way
20 to start off is create an EUtils object.
21
22 >>> from Bio import EUtils
23 >>> from Bio.EUtils.ThinClient import ThinClient
24 >>> eutils = ThinClient.ThinClient()
25 >>>
26
27 You can search Entrez with the "esearch" method. This does a query on
28 the server, which generates a list of identifiers for records that
29 matched the query. However, not all the identifiers are returned.
30 You can request only a subset of the matches (using the 'retstart' and
31 'retmax') terms. This is useful because searches like 'cancer' can
32 have over 1.4 million matches. Most people would rather change the
33 query or look at more details about the first few hits than wait to
34 download all the identifiers before doing anything else.
35
36 The esearch method, and indeed all these methods, returns a
37 'urllib.addinfourl' which is an HTTP socket connection that has
38 already parsed the HTTP header and is ready to read the data from the
39 server.
40
41 For example, here's a query and how to use it
42
43 Search in PubMed for the term cancer for the entrez date from the
44 last 60 days and retrieve the first 10 IDs and translations using
45 the history parameter.
46
47 >>> infile = eutils.esearch("cancer",
48 ... daterange = EUtils.WithinNDays(60, "edat"),
49 ... retmax = 10)
50 >>>
51 >>> print infile.read()
52 <?xml version="1.0"?>
53 <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
54 <eSearchResult>
55 <Count>7228</Count>
56 <RetMax>10</RetMax>
57 <RetStart>0</RetStart>
58 <IdList>
59 <Id>12503096</Id>
60 <Id>12503075</Id>
61 <Id>12503073</Id>
62 <Id>12503033</Id>
63 <Id>12503030</Id>
64 <Id>12503028</Id>
65 <Id>12502932</Id>
66 <Id>12502925</Id>
67 <Id>12502881</Id>
68 <Id>12502872</Id>
69 </IdList>
70 <TranslationSet>
71 <Translation>
72 <From>cancer%5BAll+Fields%5D</From>
73 <To>(%22neoplasms%22%5BMeSH+Terms%5D+OR+cancer%5BText+Word%5D)</To>
74 </Translation>
75 </TranslationSet>
76 <TranslationStack>
77 <TermSet>
78 <Term>"neoplasms"[MeSH Terms]</Term>
79 <Field>MeSH Terms</Field>
80 <Count>1407151</Count>
81 <Explode>Y</Explode>
82 </TermSet>
83 <TermSet>
84 <Term>cancer[Text Word]</Term>
85 <Field>Text Word</Field>
86 <Count>382919</Count>
87 <Explode>Y</Explode>
88 </TermSet>
89 <OP>OR</OP>
90 <TermSet>
91 <Term>2002/10/30[edat]</Term>
92 <Field>edat</Field>
93 <Count>-1</Count>
94 <Explode>Y</Explode>
95 </TermSet>
96 <TermSet>
97 <Term>2002/12/29[edat]</Term>
98 <Field>edat</Field>
99 <Count>-1</Count>
100 <Explode>Y</Explode>
101 </TermSet>
102 <OP>RANGE</OP>
103 <OP>AND</OP>
104 </TranslationStack>
105 </eSearchResult>
106
107 >>>
108
109 You get a raw XML input stream which you can process in many ways.
110 (The appropriate DTDs are included in the subdirectory "DTDs" and see
111 also the included POM reading code.)
112
113 WARNING! As of this writing (2002/12/3) NCBI returns their
114 XML encoded as Latin-1 but their processing instruction says
115 it is UTF-8 because they leave out the "encoding" attribute.
116 Until they fix it you will need to recode the input stream
117 before processing it with XML tools, like this
118
119 import codecs
120 infile = codecs.EncodedFile(infile, "utf-8", "iso-8859-1")
121
122
123 The XML fields are mostly understandable:
124 Count -- the total number of matches from this search
125 RetMax -- the number of <ID> values returned in this subset
126 RetStart -- the start position of this subset in the list of
127 all matches
128
129 IDList and ID -- the identifiers in this subset
130
131 TranslationSet / Translation -- if the search field is not
132 explicitly specified ("qualified"), then the server will
133 apply a set of hueristics to improve the query. Eg, in
134 this case "cancer" is first parsed as
135 cancer[All Fields]
136 then turned into the query
137 "neoplasms"[MeSH Terms] OR cancer[Text Word]
138
139 Note that these terms are URL escaped.
140 For details on how the translation is done, see
141 http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#AutomaticTermMapping
142
143 TranslationStack -- The (possibly 'improved' query) fully
144 parsed out and converted into a postfix (RPN) notation.
145 The above example is written in the Entrez query language as
146
147 ("neoplasms"[MeSH Terms] OR cancer[Text Word]) AND
148 2002/10/30:2002/12/29[edat]
149 Note that these terms are *not* URL escaped. Nothing like
150 a bit of inconsistency for the soul.
151
152 The "Count" field shows how many matches were found for each
153 term of the expression. I don't know what "Explode" does.
154
155
156 Let's get more information about the first record, which has an id of
157 12503096. There are two ways to query for information, one uses a set
158 of identifiers and the other uses the history. I'll talk about the
159 history one in a bit. To use a set of identifiers you need to make a
160 DBIds object containing the that list.
161
162 >>> dbids = EUtils.DBIds("pubmed", ["12503096"])
163 >>>
164
165 Now get the summary using dbids
166
167 >>> infile = eutils.esummary_using_dbids(dbids)
168 >>> print infile.read()
169 <?xml version="1.0"?>
170 <!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD eSummaryResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSummary_020511.dtd">
171 <eSummaryResult>
172 <DocSum>
173 <Id>12503096</Id>
174 <Item Name="PubDate" Type="Date">2003 Jan 30</Item>
175 <Item Name="Source" Type="String">Am J Med Genet</Item>
176 <Item Name="Authors" Type="String">Coyne JC, Kruus L, Racioppo M, Calzone KA, Armstrong K</Item>
177 <Item Name="Title" Type="String">What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer?</Item>
178 <Item Name="Volume" Type="String">116</Item>
179 <Item Name="Pages" Type="String">222-8</Item>
180 <Item Name="EntrezDate" Type="Date">2002/12/28 04:00</Item>
181 <Item Name="PubMedId" Type="Integer">12503096</Item>
182 <Item Name="MedlineId" Type="Integer">22390532</Item>
183 <Item Name="Lang" Type="String">English</Item>
184 <Item Name="PubType" Type="String"></Item>
185 <Item Name="RecordStatus" Type="String">PubMed - in process</Item>
186 <Item Name="Issue" Type="String">3</Item>
187 <Item Name="SO" Type="String">2003 Jan 30;116(3):222-8</Item>
188 <Item Name="DOI" Type="String">10.1002/ajmg.a.10844</Item>
189 <Item Name="JTA" Type="String">3L4</Item>
190 <Item Name="ISSN" Type="String">0148-7299</Item>
191 <Item Name="PubId" Type="String"></Item>
192 <Item Name="PubStatus" Type="Integer">4</Item>
193 <Item Name="Status" Type="Integer">5</Item>
194 <Item Name="HasAbstract" Type="Integer">1</Item>
195 <Item Name="ArticleIds" Type="List">
196 <Item Name="PubMedId" Type="String">12503096</Item>
197 <Item Name="DOI" Type="String">10.1002/ajmg.a.10844</Item>
198 <Item Name="MedlineUID" Type="String">22390532</Item>
199 </Item>
200 </DocSum>
201 </eSummaryResult>
202 >>>
203
204 This is just a summary. To get the full details, including an
205 abstract (if available) use the 'efetch' method. I'll only print a
206 bit to convince you it has an abstract.
207
208 >>> s = eutils.efetch_using_dbids(dbids).read()
209 >>> print s[587:860]
210 <ArticleTitle>What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer?</ArticleTitle>
211 <Pagination>
212 <MedlinePgn>222-8</MedlinePgn>
213 </Pagination>
214 <Abstract>
215 <AbstractText>Women recruited from a hereditary cancer registry provided
216 >>>
217
218 Suppose instead you want the data in a text format. Different
219 databases have different text formats. For example, PubMed has a
220 "docsum" format which gives just the summary of a document and
221 "medline" format as needed for a citation database. To get these, use
222 a "text" "retmode" ("return mode") and select the appropriate
223 "rettype" ("return type").
224
225 Here are examples of those two return types
226
227 >>> print eutils.efetch_using_dbids(dbids, "text", "docsum").read()[:497]
228 1: Coyne JC, Kruus L, Racioppo M, Calzone KA, Armstrong K.
229 What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer?
230 Am J Med Genet. 2003 Jan 30;116(3):222-8.
231 PMID: 12503096 [PubMed - in process]
232 >>> print eutils.efetch_using_dbids(dbids, "text", "medline").read()[:369]
233 UI - 22390532
234 PMID- 12503096
235 DA - 20021227
236 IS - 0148-7299
237 VI - 116
238 IP - 3
239 DP - 2003 Jan 30
240 TI - What do ratings of cancer-specific distress mean among women at high risk
241 of breast and ovarian cancer?
242 PG - 222-8
243 AB - Women recruited from a hereditary cancer registry provided ratings of
244 distress associated with different aspects of high-risk status
245 >>>
246
247 It's also possible to get a list of records related to a given
248 article. This is done through the "elink" method. For example,
249 here's how to get the list of PubMed articles related to the above
250 PubMed record. (Again, truncated because otherwise there is a lot of
251 data.)
252
253 >>> print eutils.elink_using_dbids(dbids).read()[:590]
254 <?xml version="1.0"?>
255 <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd">
256 <eLinkResult>
257 <LinkSet>
258 <DbFrom>pubmed</DbFrom>
259 <IdList>
260 <Id>12503096</Id>
261 </IdList>
262 <LinkSetDb>
263 <DbTo>pubmed</DbTo>
264 <LinkName>pubmed_pubmed</LinkName>
265 <Link>
266 <Id>12503096</Id>
267 <Score>2147483647</Score>
268 </Link>
269 <Link>
270 <Id>11536413</Id>
271 <Score>30817790</Score>
272 </Link>
273 <Link>
274 <Id>11340606</Id>
275 <Score>29939219</Score>
276 </Link>
277 <Link>
278 <Id>10805955</Id>
279 <Score>29584451</Score>
280 </Link>
281 >>>
282
283 For a change of pace, let's work with the protein database to learn
284 how to work with history. Suppose I want to do a multiple sequene
285 alignment of bacteriorhodopsin with all of its neighbors, where
286 "neighbors" is defined by NCBI. There are good programs for this -- I
287 just need to get the records in the right format, like FASTA.
288
289 The bacteriorhodopsin I'm interested in is BAA75200, which is
290 GI:4579714, so I'll start by asking for its neighbors.
291
292 >>> results = eutils.elink_using_dbids(
293 ... EUtils.DBIds("protein", ["4579714"]),
294 ... db = "protein").read()
295 >>> print results[:454]
296 <?xml version="1.0"?>
297 <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd">
298 <eLinkResult>
299 <LinkSet>
300 <DbFrom>protein</DbFrom>
301 <IdList>
302 <Id>4579714</Id>
303 </IdList>
304 <LinkSetDb>
305 <DbTo>protein</DbTo>
306 <LinkName>protein_protein</LinkName>
307 <Link>
308 <Id>4579714</Id>
309 <Score>2147483647</Score>
310 </Link>
311 <Link>
312 <Id>11277596</Id>
313 <Score>1279</Score>
314 </Link>
315 >>>
316
317 Let's get all the <Id> fields. (While the following isn't a good way
318 to parse XML, it is easy to understand and works well enough for this
319 example.) Note that I remove the first <Id> because that's from the
320 query and not from the results.
321
322 >>> import re
323 >>> ids = re.findall(r"<Id>(\d+)</Id>", results)
324 >>> ids = ids[1:]
325 >>> len(ids)
326 222
327 >>> dbids = EUtils.DBIds("protein", ids)
328 >>>
329
330 That's a lot of records. I could use 'efetch_using_dbids' but there's
331 a problem with that. Efetch uses the HTTP GET protocol to pass
332 information to the EUtils server. ("GET" is what's used when you type
333 a URL in the browser.) Each id takes about 9 characters, so the URL
334 would be over 2,000 characters long. This may not work on some
335 systems, for example, some proxies do not support long URLs. (Search
336 for "very long URLs" for examples.)
337
338 Instead, we'll upload the list to the server then fetch the FASTA
339 version using the history.
340
341 The first step is to upload the data. We want to put that into the
342 history so we set 'usehistory' to true. There's no existing history
343 so the webenv string is None.
344
345
346 >>> print eutils.epost(dbids, usehistory = 1, webenv = None).read()
347 <?xml version="1.0"?>
348 <!DOCTYPE ePostResult PUBLIC "-//NLM//DTD ePostResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ePost_020511.dtd">
349 <ePostResult>
350 <QueryKey>1</QueryKey>
351 <WebEnv>%7BPgTHRHFBsJfC%3C%5C%5C%5B%3EAfJCKQ%5Ey%60%3CGkH%5DH%5E%3DJHGBKAJ%3F%40CbCiG%3FE%3C</WebEnv>
352 </ePostResult>
353
354 >>>
355
356 This says that the identifiers were saved as query #1, which will be
357 used later on as the "query_key" field. The WebEnv is a cookie (or
358 token) used to tell the server where to find that query. The WebEnv
359 changes after every history-enabled ESearch or EPost so you'll need to
360 parse the output from those to get the new WebEnv field. You'll also
361 need to unquote it since it is URL-escaped.
362
363 Also, you will need to pass in the name of the database used for the
364 query in order to access the history. Why? I don't know -- I figure
365 the WebEnv and query_key should be enough to get the database name.
366
367 >>> import urllib
368 >>> webenv = urllib.unquote("%7BPgTHRHFBsJfC%3C%5C%5C%5B%3EAfJCKQ%5Ey%60%3CGkH%5DH%5E%3DJHGBKAJ%3F%40CbCiG%3FE%3C")
369 >>> print webenv
370 {PgTHRHFBsJfC<\\[>AfJCKQ^y`<GkH]H^=JHGBKAJ?@CbCiG?E<
371 >>>
372
373 Okay, now to get the data in FASTA format. Notice that I need the
374 'retmax' in order to include all the records in the result. (The
375 default is 20 records.)
376
377 >>> fasta = eutils.efetch_using_history("protein", webenv, query_key = "1",
378 ... retmode = "text", rettype = "fasta",
379 ... retmax = len(dbids)).read()
380 >>> fasta.count(">")
381 222
382 >>> print fasta[:694]
383 >gi|14194475|sp|O93742|BACH_HALSD Halorhodopsin (HR)
384 MMETAADALASGTVPLEMTQTQIFEAIQGDTLLASSLWINIALAGLSILLFVYMGRNLEDPRAQLIFVAT
385 LMVPLVSISSYTGLVSGLTVSFLEMPAGHALAGQEVLTPWGRYLTWALSTPMILVALGLLAGSNATKLFT
386 AVTADIGMCVTGLAAALTTSSYLLRWVWYVISCAFFVVVLYVLLAEWAEDAEVAGTAEIFNTLKLLTVVL
387 WLGYPIFWALGAEGLAVLDVAVTSWAYSGMDIVAKYLFAFLLLRWVVDNERTVAGMAAGLGAPLARCAPA
388 DD
389 >gi|14194474|sp|O93741|BACH_HALS4 Halorhodopsin (HR)
390 MRSRTYHDQSVCGPYGSQRTDCDRDTDAGSDTDVHGAQVATQIRTDTLLHSSLWVNIALAGLSILVFLYM
391 ARTVRANRARLIVGATLMIPLVSLSSYLGLVTGLTAGPIEMPAAHALAGEDVLSQWGRYLTWTLSTPMIL
392 LALGWLAEVDTADLFVVIAADIGMCLTGLAAALTTSSYAFRWAFYLVSTAFFVVVLYALLAKWPTNAEAA
393 GTGDIFGTLRWLTVILWLGYPILWALGVEGFALVDSVGLTSWGYSLLDIGAKYLFAALLLRWVANNERTI
394 AVGQRSGRGAIGDPVED
395 >>>
396
397 To round things out, here's a query which refines the previous query.
398 I want to get all records from the first search which also have the
399 word "Structure" in them. (My background was originally structural
400 biophysics, whaddya expect? :)
401
402 >>> print eutils.search("#1 AND structure", db = "protein", usehistory = 1,
403 ... webenv = webenv).read()
404 <?xml version="1.0"?>
405 <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
406 <eSearchResult>
407 <Count>67</Count>
408 <RetMax>20</RetMax>
409 <RetStart>0</RetStart>
410 <QueryKey>2</QueryKey>
411 <WebEnv>UdvMf%3F%60G%3DIE%60bG%3DGec%3E%3D%3Cbc_%5DgBAf%3EAi_e%5EAJcHgDi%3CIqGdE%7BmC%3C</WebEnv>
412 <IdList>
413 <Id>461608</Id>
414 <Id>114808</Id>
415 <Id>1364150</Id>
416 <Id>1363466</Id>
417 <Id>1083906</Id>
418 <Id>99232</Id>
419 <Id>99212</Id>
420 <Id>81076</Id>
421 <Id>114811</Id>
422 <Id>24158915</Id>
423 <Id>24158914</Id>
424 <Id>24158913</Id>
425 <Id>1168615</Id>
426 <Id>114812</Id>
427 <Id>114809</Id>
428 <Id>17942995</Id>
429 <Id>17942994</Id>
430 <Id>17942993</Id>
431 <Id>20151159</Id>
432 <Id>20150922</Id>
433 </IdList>
434 <TranslationSet>
435 </TranslationSet>
436 <TranslationStack>
437 <TermSet>
438 <Term>#1</Term>
439 <Field>All Fields</Field>
440 <Count>222</Count>
441 <Explode>Y</Explode>
442 </TermSet>
443 <TermSet>
444 <Term>structure[All Fields]</Term>
445 <Field>All Fields</Field>
446 <Count>142002</Count>
447 <Explode>Y</Explode>
448 </TermSet>
449 <OP>AND</OP>
450 </TranslationStack>
451 </eSearchResult>
452
453 >>>
454
455 One last thing about history. It doesn't last very long -- perhaps an
456 hour or so. (Untested.) You may be able to toss it some keep-alive
457 signal every once in a while. Or you may want to keep
458
459 The known 'db' fields and primary IDs (if known) are
460 genome -- GI number
461 nucleotide -- GI number
462 omim -- MIM number
463 popset -- GI number
464 protein -- GI number
465 pubmed -- PMID
466 sequences (not available; this will combine all sequence databases)
467 structure -- MMDB ID
468 taxonomy -- TAXID
469
470 The 'field' parameter is different for different databases. The
471 fields for PubMed are listed at
472
473 http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#SearchFieldDescriptionsandTags
474
475 Affiliation -- AD
476 All Fields -- All
477 Author -- AU
478 EC/RN Number -- RN
479 Entrez Date -- EDAT (also valid for 'datetype')
480 Filter -- FILTER
481 Issue -- IP
482 Journal Title -- TA
483 Language -- LA
484 MeSH Date -- MHDA (also valid for 'datetype')
485 MeSH Major Topic -- MAJR
486 MeSH Subheadings -- SH
487 MeSH Terms -- MH
488 Pagination -- PG
489 Personal Name as Subject -- PS
490 Publication Date -- DP (also valid for 'datetype')
491 Publication Type -- PT
492 Secondary Source ID -- SI
493 Subset -- SB
494 Substance Name -- NM
495 Text Words -- TW
496 Title -- TI
497 Title/Abstract -- TIAB
498 Unique Identifiers -- UID
499 Volume -- VI
500
501 The fields marked as 'datetype' can also be used for date searches.
502 Date searches can be done in the query (for example, as
503
504 1990/01/01:1999/12/31[edat]
505
506 or by passing a WithinNDays or DateRange field to the 'date' parameter
507 of the search.
508
509
510 Please pay attention to the usage limits! The are listed at
511 http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
512
513 At the time of this writing they are:
514 * Run retrieval scripts on weekends or between 9 PM and 5 AM ET
515 weekdays for any series of more than 100 requests.
516 * Make no more than one request every 3 seconds.
517 * Only 5000 PubMed records may be retrieved in a single day.
518
519 * NCBI's Disclaimer and Copyright notice must be evident to users
520 of your service. NLM does not hold the copyright on the PubMed
521 abstracts the journal publishers do. NLM provides no legal
522 advice concerning distribution of copyrighted materials, consult
523 your legal counsel.
524
525 (Their disclaimer is at
526 http://www.ncbi.nlm.nih.gov/About/disclaimer.html )
527
528
529 """
530
531 import urllib, urllib2, cStringIO
532
533 DUMP_URL = 0
534 DUMP_RESULT = 0
535
536
537
538
539
540
541
542 TOOL = "EUtils_Python_client"
543 EMAIL = "biopython-dev@biopython.org"
544
545 assert " " not in TOOL
546 assert " " not in EMAIL
547
549 """Internal function: convert a list of ids to a comma-seperated string"""
550
551
552
553
554 if not dbids:
555 raise TypeError("dbids list must have at least one term")
556 for x in dbids.ids:
557 if "," in x:
558 raise TypeError("identifiers cannot contain a comma: %r " %
559 (x,))
560 id_string = ",".join(dbids.ids)
561 assert id_string.count(",") == len(dbids.ids)-1, "double checking"
562 return id_string
563
565 """Client-side interface to the EUtils services
566
567 See the module docstring for much more complete information.
568 """
569 - def __init__(self,
570 opener = None,
571 tool = TOOL,
572 email = EMAIL,
573 baseurl = "http://www.ncbi.nlm.nih.gov/entrez/eutils/"):
574 """opener = None, tool = TOOL, email = EMAIL, baseurl = ".../eutils/"
575
576 'opener' -- an object which implements the 'open' method like a
577 urllib2.OpenDirector. Defaults to urllib2.build_opener()
578
579 'tool' -- the term to use for the 'tool' field, used by NCBI to
580 track which programs use their services. If you write your
581 own tool based on this package, use your own tool name.
582
583 'email' -- a way for NCBI to contact you (the developer, not
584 the user!) if there are problems and to tell you about
585 updates or changes to their system.
586
587 'baseurl' -- location of NCBI's EUtils directory. Shouldn't need
588 to change this at all.
589 """
590
591 if tool is not None and " " in tool:
592 raise TypeError("No spaces allowed in 'tool'")
593 if email is not None and " " in email:
594 raise TypeError("No spaces allowed in 'email'")
595
596 if opener is None:
597 opener = urllib2.build_opener()
598
599 self.opener = opener
600 self.tool = tool
601 self.email = email
602 self.baseurl = baseurl
603
605 """Internal function to add and remove fields from a query"""
606 q = query.copy()
607
608
609 q["tool"] = self.tool
610 q["email"] = self.email
611
612
613
614
615 if "usehistory" in q:
616 if q["usehistory"]:
617 q["usehistory"] = "y"
618 else:
619 q["usehistory"] = None
620
621
622
623 for k, v in q.items():
624 if v is None:
625 del q[k]
626
627
628 return urllib.urlencode(q)
629
630 - def _get(self, program, query):
631 """Internal function: send the query string to the program as GET"""
632
633
634 q = self._fixup_query(query)
635 url = self.baseurl + program + "?" + q
636 if DUMP_URL:
637 print "Opening with GET:", url
638 if DUMP_RESULT:
639 print " ================== Results ============= "
640 s = self.opener.open(url).read()
641 print s
642 print " ================== Finished ============ "
643 return cStringIO.StringIO(s)
644 return self.opener.open(url)
645
646 - def esearch(self,
647 term,
648 db = "pubmed",
649 field = None,
650 daterange = None,
651
652 retstart = 0,
653 retmax = 20,
654
655 usehistory = 0,
656 webenv = None,
657 ):
658
659 """term, db="pubmed", field=None, daterange=None, retstart=0, retmax=20, usehistory=0, webenv=none
660
661 Search the given database for records matching the query given
662 in the 'term'. See the module docstring for examples.
663
664 'term' -- the query string in the Entrez query language; see
665 http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html
666 'db' -- the database to search
667
668 'field' -- the field to use for unqualified words
669 Eg, "dalke[au] AND gene" with field==None becomes
670 dalke[au] AND (genes[MeSH Terms] OR gene[Text Word]
671 and "dalke[au] AND gene" with field=="au" becomes
672 dalke[au] AND genes[Author]
673 (Yes, I think the first "au" should be "Author" too)
674
675 'daterange' -- a date restriction; either WithinNDays or DateRange
676 'retstart' -- include identifiers in the output, starting with
677 position 'retstart' (normally starts with 0)
678 'retmax' -- return at most 'retmax' identifiers in the output
679 (if not specified, NCBI returns 20 identifiers)
680
681 'usehistory' -- flag to enable history tracking
682 'webenv' -- if this string is given, add the search results
683 to an existing history. (WARNING: the history disappers
684 after about an hour of non-use.)
685
686 You will need to parse the output XML to get the new QueryKey
687 and WebEnv fields.
688
689 Returns an input stream from an HTTP request. The stream
690 contents are in XML.
691 """
692 query = {"term": term,
693 "db": db,
694 "field": field,
695 "retstart": retstart,
696 "retmax": retmax,
697 "usehistory": usehistory,
698 "WebEnv": webenv,
699 }
700 if daterange is not None:
701 query.update(daterange.get_query_params())
702
703 return self._get(program = "esearch.fcgi", query = query)
704
705 - def epost(self,
706 dbids,
707
708 webenv = None,
709 ):
710 """dbids, webenv = None
711
712 Create a new collection in the history containing the given
713 list of identifiers for a database.
714
715 'dbids' -- a DBIds, which contains the database name and
716 a list of identifiers in that database
717 'webenv' -- if this string is given, add the collection
718 to an existing history. (WARNING: the history disappers
719 after about an hour of non-use.)
720
721 You will need to parse the output XML to get the new QueryKey
722 and WebEnv fields. NOTE: The order of the IDs on the server
723 is NOT NECESSARILY the same as the upload order.
724
725 Returns an input stream from an HTTP request. The stream
726 contents are in XML.
727 """
728 id_string = _dbids_to_id_string(dbids)
729
730
731 program = "epost.fcgi"
732 query = {"id": id_string,
733 "db": dbids.db,
734 "WebEnv": webenv,
735 }
736 q = self._fixup_query(query)
737
738
739
740 if DUMP_URL:
741 print "Opening with POST:", self.baseurl + program + "?" + q
742 if DUMP_RESULT:
743 print " ================== Results ============= "
744 s = self.opener.open(self.baseurl + program, q).read()
745 print s
746 print " ================== Finished ============ "
747 return cStringIO.StringIO(s)
748 return self.opener.open(self.baseurl + program, q)
749
750 - def esummary_using_history(self,
751 db,
752
753
754 webenv,
755 query_key,
756 retstart = 0,
757 retmax = 20,
758 retmode = "xml",
759 ):
760 """db, webenv, query_key, retstart = 0, retmax = 20, retmode = "xml"
761
762 Get the summary for a collection of records in the history
763
764 'db' -- the database containing the history/collection
765 'webenv' -- the WebEnv cookie for the history
766 'query_key' -- the collection in the history
767 'retstart' -- get the summaries starting with this position
768 'retmax' -- get at most this many summaries
769 'retmode' -- can only be 'xml'. (Are there others?)
770
771 Returns an input stream from an HTTP request. The stream
772 contents are in 'retmode' format.
773 """
774 return self._get(program = "esummary.fcgi",
775 query = {"db": db,
776 "WebEnv": webenv,
777 "query_key": query_key,
778 "retstart": retstart,
779 "retmax": retmax,
780 "retmode": retmode,
781 })
782
787 """dbids, retmode = "xml"
788
789 Get the summary for records specified by identifier
790
791 'dbids' -- a DBIds containing the database name and list
792 of record identifiers
793 'retmode' -- can only be 'xml'
794
795 Returns an input stream from an HTTP request. The stream
796 contents are in 'retmode' format.
797 """
798
799 id_string = _dbids_to_id_string(dbids)
800 return self._get(program = "esummary.fcgi",
801 query = {"id": id_string,
802 "db": dbids.db,
803
804 "retmode": retmode,
805 })
806
807 - def efetch_using_history(self,
808 db,
809 webenv,
810 query_key,
811
812 retstart = 0,
813 retmax = 20,
814
815 retmode = None,
816 rettype = None,
817
818
819 seq_start = None,
820 seq_stop = None,
821 strand = None,
822 complexity = None,
823 ):
824 """db, webenv, query_key, retstart=0, retmax=20, retmode=None, rettype=None, seq_start=None, seq_stop=None, strand=None, complexity=None
825
826 Fetch information for a collection of records in the history,
827 in a variety of formats.
828
829 'db' -- the database containing the history/collection
830 'webenv' -- the WebEnv cookie for the history
831 'query_key' -- the collection in the history
832 'retstart' -- get the formatted data starting with this position
833 'retmax' -- get data for at most this many records
834
835 These options work for sequence databases
836
837 'seq_start' -- return the sequence starting at this position.
838 The first position is numbered 1
839 'seq_stop' -- return the sequence ending at this position
840 Includes the stop position, so seq_start = 1 and
841 seq_stop = 5 returns the first 5 bases/residues.
842 'strand' -- strand. Use EUtils.PLUS_STRAND (== 1) for plus
843 strand and EUtils.MINUS_STRAND (== 2) for negative
844 'complexity' -- regulates the level of display. Options are
845 0 - get the whole blob
846 1 - get the bioseq for gi of interest (default in Entrez)
847 2 - get the minimal bioseq-set containing the gi of interest
848 3 - get the minimal nuc-prot containing the gi of interest
849 4 - get the minimal pub-set containing the gi of interest
850
851 http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html
852
853 The valid retmode and rettype values are
854
855 For publication databases (omim, pubmed, journals) the
856 retmodes are 'xml', 'asn.1', 'text', and 'html'.
857
858 If retmode == xml ---> XML (default)
859 if retmode == asn.1 ---> ASN.1
860
861 The following rettype values work for retmode == 'text'.
862
863 docsum ----> author / title / cite / PMID
864 brief ----> a one-liner up to about 66 chars
865 abstract ----> cite / title / author / dept /
866 full abstract / PMID
867 citation ----> cite / title / author / dept /
868 full abstract / MeSH terms /
869 substances / PMID
870 medline ----> full record in medline format
871 asn.1 ----> full record in one ASN.1 format
872 mlasn1 ----> full record in another ASN.1 format
873 uilist ----> list of uids, one per line
874 sgml ----> same as retmode="xml"
875
876 Sequence databases (genome, protein, nucleotide, popset)
877 also have retmode values of 'xml', 'asn.1', 'text', and
878 'html'.
879
880 If retmode == 'xml' ---> XML (default; only supports
881 rettype == 'native')
882 If retmode == 'asn.1' ---> ASN.1 text (only works for rettype
883 of 'native' and 'sequin')
884
885 The following work with a retmode of 'text' or 'html'
886
887 native ----> Default format for viewing sequences
888 fasta ----> FASTA view of a sequence
889 gb ----> GenBank view for sequences, constructed sequences
890 will be shown as contigs (by pointing to its parts).
891 Valid for nucleotides.
892 gbwithparts --> GenBank view for sequences, the sequence will
893 always be shown. Valid for nucleotides.
894 est ----> EST Report. Valid for sequences from
895 dbEST database.
896 gss ----> GSS Report. Valid for sequences from dbGSS
897 database.
898 gp ----> GenPept view. Valid for proteins.
899 seqid ----> To convert list of gis into list of seqids
900 acc ----> To convert list of gis into list of accessions
901
902 # XXX TRY THESE
903 fasta_xml
904 gb_xml
905 gi (same as uilist?)
906
907
908
909 A retmode of 'file' is the same as 'text' except the data is
910 sent with a Content-Type of application/octet-stream, which tells
911 the browser to save the data to a file.
912
913 A retmode of 'html' is the same as 'text' except a HTML header
914 and footer are added and special character are properly escaped.
915
916 Returns an input stream from an HTTP request. The stream
917 contents are in the requested format.
918 """
919
920
921
922
923
924
925
926
927
928
929
930 if retstart == 0 and retmax > 500:
931 retmax = None
932 return self._get(program = "efetch.fcgi",
933 query = {"db": db,
934 "WebEnv": webenv,
935 "query_key": query_key,
936 "retstart": retstart,
937 "retmax": retmax,
938 "retmode": retmode,
939 "rettype": rettype,
940 "seq_start": seq_start,
941 "seq_stop": seq_stop,
942 "strand": strand,
943 "complexity": complexity,
944 })
945
946 - def efetch_using_dbids(self,
947 dbids,
948 retmode = None,
949 rettype = None,
950
951
952 seq_start = None,
953 seq_stop = None,
954 strand = None,
955 complexity = None,
956 ):
957 """dbids, retmode = None, rettype = None, seq_start = None, seq_stop = None, strand = None, complexity = None
958
959 Fetch information for records specified by identifier
960
961 'dbids' -- a DBIds containing the database name and list
962 of record identifiers
963 'retmode' -- See the docstring for 'efetch_using_history'
964 'rettype' -- See the docstring for 'efetch_using_history'
965
966 These options work for sequence databases
967
968 'seq_start' -- return the sequence starting at this position.
969 The first position is numbered 1
970 'seq_stop' -- return the sequence ending at this position
971 Includes the stop position, so seq_start = 1 and
972 seq_stop = 5 returns the first 5 bases/residues.
973 'strand' -- strand. Use EUtils.PLUS_STRAND (== 1) for plus
974 strand and EUtils.MINUS_STRAND (== 2) for negative
975 'complexity' -- regulates the level of display. Options are
976 0 - get the whole blob
977 1 - get the bioseq for gi of interest (default in Entrez)
978 2 - get the minimal bioseq-set containing the gi of interest
979 3 - get the minimal nuc-prot containing the gi of interest
980 4 - get the minimal pub-set containing the gi of interest
981
982 Returns an input stream from an HTTP request. The stream
983 contents are in the requested format.
984 """
985 id_string = _dbids_to_id_string(dbids)
986 return self._get(program = "efetch.fcgi",
987 query = {"id": id_string,
988 "db": dbids.db,
989
990 "retmode": retmode,
991 "rettype": rettype,
992 "seq_start": seq_start,
993 "seq_stop": seq_stop,
994 "strand": strand,
995 "complexity": complexity,
996 })
997
998 - def elink_using_history(self,
999 dbfrom,
1000 webenv,
1001 query_key,
1002
1003 db = "pubmed",
1004
1005 retstart = 0,
1006 retmax = 20,
1007
1008 cmd = "neighbor",
1009 retmode = None,
1010
1011 term = None,
1012 field = None,
1013
1014 daterange = None,
1015 ):
1016 """dbfrom, webenv, query_key, db="pubmed", retstart=0, retmax=20, cmd="neighbor", retmode=None, term=None, field=None, daterange=None,
1017
1018 Find records related (in various ways) to a collection of
1019 records in the history.
1020
1021 'dbfrom' -- this is the name of the database containing the
1022 collection of record. NOTE! For the other methods
1023 this is named 'db'. But I'm keeping NCBI's notation.
1024 This is where the records come FROM.
1025 'webenv' -- the WebEnv cookie for the history
1026 'query_key' -- the collection in the history
1027
1028 'db' -- Where the records link TO. This is where you want to
1029 find the new records. For example, if you want to
1030 find PubMed records related to a protein then 'dbfrom'
1031 is 'protein' and 'db' is 'pubmed'
1032
1033 'cmd'-- one of the following (unless specified, retmode is the
1034 default value, which returns data in XML)
1035 neighbor: Display neighbors and their scores by database and ID.
1036 (This is the default 'cmd'.)
1037 prlinks: List the hyperlink to the primary LinkOut provider
1038 for multiple IDs and database.
1039 When retmode == 'ref' this URL redirects the browser
1040 to the primary LinkOut provider for a single ID
1041 and database.
1042 llinks: List LinkOut URLs and Attributes for multiple IDs
1043 and database.
1044 lcheck: Check for the existence (Y or N) of an external
1045 link in for multiple IDs and database.
1046 ncheck: Check for the existence of a neighbor link for
1047 each ID, e.g., Related Articles in PubMed.
1048
1049 'retstart' -- get the formatted data starting with this position
1050 'retmax' -- get data for at most this many records
1051
1052 'retmode' -- only used with 'prlinks'
1053
1054 'term' -- restrict results to records which also match this
1055 Entrez search
1056 'field' -- the field to use for unqualified words
1057
1058 'daterange' -- restrict results to records which also match this
1059 date criteria; either WithinNDays or DateRange
1060 NOTE: DateRange must have both mindate and maxdate
1061
1062 Some examples:
1063 In PubMed, to get a list of "Related Articles"
1064 dbfrom = pubmed
1065 cmd = neighbor
1066
1067 To get MEDLINE index only related article
1068 dbfrom = pubmed
1069 db = pubmed
1070 term = medline[sb]
1071 cmd = neighbor
1072
1073 Given a PubMed record, find the related nucleotide records
1074 dbfrom = pubmed
1075 db = nucleotide (or "protein" for related protein records)
1076 cmd = neighbor
1077
1078 To get "LinkOuts" (external links) for a PubMed record set
1079 dbfrom = pubmed
1080 cmd = llinks
1081
1082 Get the primary link information for a PubMed document; includes
1083 various hyperlinks, image URL for the provider, etc.
1084 dbfrom = pubmed
1085 cmd = prlinks
1086 (optional) retmode = "ref" (causes a redirect to the privder)
1087
1088 Returns an input stream from an HTTP request. The stream
1089 contents are in XML unless 'retmode' is 'ref'.
1090 """
1091 query = {"WebEnv": webenv,
1092 "query_key": query_key,
1093 "db": db,
1094 "dbfrom": dbfrom,
1095 "cmd": cmd,
1096 "retstart": retstart,
1097 "retmax": retmax,
1098 "retmode": retmode,
1099 "term": term,
1100 "field": field,
1101 }
1102 if daterange is not None:
1103 if daterange.mindate is None or daterange.maxdate is None:
1104 raise TypeError("Both mindate and maxdate must be set for eLink")
1105 query.update(daterange.get_query_params())
1106 return self._get(program = "elink.fcgi", query = query)
1107
1108 - def elink_using_dbids(self,
1109 dbids,
1110 db = "pubmed",
1111
1112 cmd = "neighbor",
1113
1114 retmode = None,
1115 term = None,
1116 field = None,
1117
1118 daterange = None,
1119
1120 ):
1121 """dbids, db="pubmed", cmd="neighbor", retmode=None, term=None, daterange=None
1122
1123 Find records related (in various ways) to a set of records
1124 specified by identifier.
1125
1126 'dbids' -- a DBIds containing the database name and list
1127 of record identifiers
1128 'db' -- Where the records link TO. This is where you want to
1129 find the new records. For example, if you want to
1130 find PubMed records related to a protein then 'db'
1131 is 'pubmed'. (The database they are from is part
1132 of the DBIds object.)
1133
1134 'cmd' -- see the docstring for 'elink_using_history'
1135 'retmode' -- see 'elink_using_history'
1136 'term' -- see 'elink_using_history'
1137 'daterange' -- see 'elink_using_history'
1138
1139 Returns an input stream from an HTTP request. The stream
1140 contents are in XML unless 'retmode' is 'ref'.
1141 """
1142 id_string = _dbids_to_id_string(dbids)
1143 query = {"id": id_string,
1144 "db": db,
1145 "dbfrom": dbids.db,
1146 "cmd": cmd,
1147 "retmode": retmode,
1148 "field" : field,
1149 "term": term,
1150 }
1151 if daterange is not None:
1152 import Datatypes
1153 if isinstance(daterange, Datatypes.DateRange) and \
1154 (daterange.mindate is None or daterange.maxdate is None):
1155 raise TypeError("Both mindate and maxdate must be set for eLink")
1156 query.update(daterange.get_query_params())
1157
1158 return self._get(program = "elink.fcgi", query = query)
1159