Package Bio :: Package SeqIO
[hide private]
[frames] | no frames]

Package SeqIO

source code

Sequence input/output designed to look similar to the bioperl design.

Input
=====
The main function is Bio.SeqIO.parse(...) which takes an input file handle,
and format string.  This returns an iterator giving SeqRecord objects.

    from Bio import SeqIO
    handle = open("example.fasta", "rU")
    for record in SeqIO.parse(handle, "fasta") :
        print record

Note that the parse() function will all invoke the relevant parser for
the format with its default settings.  You may want more control, in which case
you need to create a format specific sequence iterator directly.

For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records
using a sequence iterator can save you a lot of memory (RAM).  There is less
benefit for interlaced file formats (e.g. most multiple alignment file formats).
However, an iterator only lets you access the records one by one.

If you want random access to the records by number, turn this into a list:

    from Bio import SeqIO
    handle = open("example.fasta", "rU")
    records = list(SeqIO.parse(handle, "fasta"))
    print records[0]

If you want random access to the records by a key such as the record id, turn
the iterator into a dictionary:

    from Bio import SeqIO
    handle = open("example.fasta", "rU")
    record_dict = SeqIO.to_dict(SeqIO.parse(handle, "format"))
    print record["gi:12345678"]


Input - Alignments
==================
Currently an alignment class cannot be created from SeqRecord objects.
Instead, use the to_alignment(...) function, like so:

    from Bio import SeqIO
    handle = open("example.aln", "rU")
    alignment = SeqIO.to_alignment(SeqIO.parse(handle, "clustal"))

This function may be removed in future once alignments can be created
directly from SeqRecord objects.

Output
======
Use the function Bio.SeqIO.write(...), which takes a complete set of SeqRecord
objects (either as a list, or an iterator), an output file handle and of course
the file format.

    from Bio import SeqIO
    records = ...
    handle = open("example.faa", "w")
    SeqIO.write(records, handle, "fasta")
    handle.close()

In general, you are expected to call this function once (with all your records)
and then close the file handle.

Output - Advanced
=================
The effect of calling write() multiple times on a single file will vary
depending on the file format, and is best avoided unless you have a strong reason
to do so.

Trying this for certain alignment formats (e.g. phylip, clustal, stockholm) would
have the effect of concatenating several multiple sequence alignments together.
Such files are created by the PHYLIP suite of programs for bootstrap analysis.

For sequential files formats (e.g. fasta, genbank) each "record block" holds a
single sequence.  For these files it would probably be safe to call write()
multiple times.

File Formats
============
When specifying formats, use lowercase strings.

Old Files
=========
The modules Bio.SeqIO.FASTA and Bio.SeqIO.generic are depreciated and may be
removed.

Submodules [hide private]

Functions [hide private]
 
write(sequences, handle, format)
Write complete set of sequences to a file
source code
 
parse(handle, format)
Turns a sequence file into a iterator returning SeqRecords
source code
 
to_dict(sequences, key_function=None)
Turns a sequence iterator or list into a dictionary
source code
 
to_alignment(sequences, alphabet=Alphabet(), strict=True)
Returns a multiple sequence alignment
source code
Variables [hide private]
  _FormatToIterator = {"fasta": FastaIO.FastaIterator, "genbank"...
  _FormatToWriter = {"fasta": FastaIO.FastaWriter, "phylip": Phy...
Function Details [hide private]

write(sequences, handle, format)

source code 

Write complete set of sequences to a file

sequences - A list (or iterator) of SeqRecord objects handle - File handle object to write to format - What format to use.

You should close the handle after calling this function.

There is no return value.

parse(handle, format)

source code 

Turns a sequence file into a iterator returning SeqRecords

handle - handle to the file. format - string describing the file format.

If you have the file name in a string 'filename', use:

from Bio import SeqIO my_iterator = SeqIO.parse(open(filename,"rU"), format)

If you have a string 'data' containing the file contents, use:

from Bio import SeqIO from StringIO import StringIO my_iterator = SeqIO.parse(StringIO(data), format)

Note that file will be parsed with default settings, which may result in a generic alphabet or other non-ideal settings. For more control, you must use the format specific iterator directly...

to_dict(sequences, key_function=None)

source code 
Turns a sequence iterator or list into a dictionary

sequences  - An iterator that returns SeqRecord objects,
             or simply a list of SeqRecord objects.
key_function - Optional function which when given a SeqRecord
               returns a unique string for the dictionary key.

e.g. key_function = lambda rec : rec.name
or,  key_function = lambda rec : rec.description.split()[0]

If key_function is ommitted then record.id is used, on the
assumption that the records objects returned are SeqRecords
with a unique id field.

If there are duplicate keys, an error is raised.

Example usage:

from Bio import SeqIO
filename = "example.fasta"
d = SeqIO.to_dict(SeqIO.parse(open(faa_filename, "rU")),
    key_function = lambda rec : rec.description.split()[0])
print len(d)
print d.keys()[0:10]
key = d.keys()[0]
print d[key]

to_alignment(sequences, alphabet=Alphabet(), strict=True)

source code 
Returns a multiple sequence alignment

sequences -An iterator that returns SeqRecord objects,
           or simply a list of SeqRecord objects.
           All the record sequences must be the same length.
alphabet - Optional alphabet.  Stongly recommended.
strict   - Optional, defaults to True.  Should error checking
           be done?


Variables Details [hide private]

_FormatToIterator

Value:
{"fasta": FastaIO.FastaIterator, "genbank": InsdcIO.GenBankIterator, "\
genbank-cds": InsdcIO.GenBankCdsFeatureIterator, "embl": InsdcIO.EmblI\
terator, "embl-cds": InsdcIO.EmblCdsFeatureIterator, "clustal": Clusta\
lIO.ClustalIterator, "phylip": PhylipIO.PhylipIterator, "nexus": Nexus\
IO.NexusIterator, "stockholm": StockholmIO.StockholmIterator, "swiss":\
 SwissIO.SwissIterator,}

_FormatToWriter

Value:
{"fasta": FastaIO.FastaWriter, "phylip": PhylipIO.PhylipWriter, "stock\
holm": StockholmIO.StockholmWriter, "clustal": ClustalIO.ClustalWriter\
,}