The \eslmod{sqio} module contains routines for input from unaligned
sequence data files, such as FASTA files.

Several different common sequence file formats are understood, and can
be automatically recognized.

Sequences can be read sequentially from multiple sequence alignment
files, as if the MSA file was an unaligned sequence file, when the
module is augmented with the \eslmod{msa} module.

Sequences can be read from normal files, directly from the
\ccode{stdin} pipe, or from \ccode{gzip}-compressed files.

Sequence files can be automatically looked for in a list of one or
more database directories, specified by an environment variable (such
as \ccode{BLASTDB}).

Table~\ref{tbl:sqio_api} lists the functions in the \eslmod{sqio} API.
The module uses an \ccode{ESL\_SQFILE} object which works much like an
ANSI C \ccode{FILE}, maintaining information for an open sequence file
while it's being read.

% Table generated by autodoc -t esl_sqio.c (so don't edit here, edit esl_sqio.c:)
\begin{table}[hbp]
\begin{center}
{\small
\begin{tabular}{|ll|}\hline
\apisubhead{The \ccode{ESL\_SQFILE} object.}\\
\hyperlink{func:esl_sqfile_Open()}{\ccode{esl\_sqfile\_Open()}} & Description.\\
\hyperlink{func:esl_sqfile_Close()}{\ccode{esl\_sqfile\_Close()}} & Description.\\
\apisubhead{Sequence input/output}\\
\hyperlink{func:esl_sqio_Read()}{\ccode{esl\_sqio\_Read()}} & Description.\\
\hyperlink{func:esl_sqio_Write()}{\ccode{esl\_sqio\_Write()}} & Description.\\
\hyperlink{func:esl_sqio_Echo()}{\ccode{esl\_sqio\_Echo()}} & Echo the next sequence record onto output stream\\
\hyperlink{func:esl_sqio_WhatFormat()}{\ccode{esl\_sqio\_WhatFormat()}} & Description.\\
\hyperlink{func:esl_sqio_EncodeFormat()}{\ccode{esl\_sqio\_EncodeFormat()}} & Description.\\
\hyperlink{func:esl_sqio_DecodeFormat()}{\ccode{esl\_sqio\_DecodeFormat()}} & Returns descriptive string for file format code.\\
\hyperlink{func:esl_sqio_IsAlignment()}{\ccode{esl\_sqio\_IsAlignment()}} & Description.\\
\hyperlink{func:esl_sqio_Position()}{\ccode{esl\_sqio\_Position()}} & Description.\\
\hyperlink{func:esl_sqio_Rewind()}{\ccode{esl\_sqio\_Rewind()}} & Description.\\
\hyperlink{func:esl_sqfile_GuessAlphabet()}{\ccode{esl\_sqfile\_GuessAlphabet()}} & Guess the alphabet of an open \ccode{ESL\_SQFILE}\\
\apisubhead{Fast random access in a seqfile  [with SSI augmentation]}\\
\hyperlink{func:esl_sqfile_OpenSSI()}{\ccode{esl\_sqfile\_OpenSSI()}} & Opens an SSI index associated with a seq file.\\
\hyperlink{func:esl_sqfile_PositionByKey()}{\ccode{esl\_sqfile\_PositionByKey()}} & Use SSI to reposition seq file to a particular sequence.\\
\hyperlink{func:esl_sqfile_PositionByNumber()}{\ccode{esl\_sqfile\_PositionByNumber()}} & Use SSI to reposition by sequence number\\
\hline
\end{tabular}
}
\end{center}
\caption{The \eslmod{sqio} API.}
\label{tbl:sqio_api}
\end{table}

\subsection{Example: reading sequences from a file}

Figure~\ref{fig:sqio_example} shows a program that opens a file, reads
sequences from it one at a time, then closes the file.

\begin{figure}
\input{cexcerpts/sqio_example}
\caption{Example of reading sequences from a file.}
\label{fig:sqio_example}
\end{figure}

A FASTA file named \ccode{seqfile} is opened for reading by calling
\ccode{esl\_sqfile\_Open(filename, format, env, \&sqfp)}, which
creates a new \ccode{ESL\_SQFILE} and returns it through the
\ccode{sqfp} pointer. If the \ccode{format} is passed as
\ccode{eslSQFILE\_UNKNOWN}, then the format of the file is
autodetected; here, we bypass autodetection by asserting that the file
is in FASTA format by passing a \ccode{eslSQFILE\_FASTA} code. (See
below for a list of valid codes and formats.) The optional \ccode{env}
argument is described below too; here, we're passing \ccode{NULL} and
not using it.

Several things can go wrong in trying to open a sequence file that are
beyond the control of Easel or your application, so it's important
that you check the return code.  \ccode{esl\_sqfile\_Open()} returns
\ccode{eslENOTFOUND} if the file can't be opened; \ccode{eslEFORMAT}
if the file is empty, or if autodetection can't determine its format;
and \ccode{eslEINVAL} if you try to autodetect format on an input
stream that can't be autodetected (a nonrewindable stream: see below
for info about reading from \ccode{stdin} and compressed
files). (Additionally, an internal error might be thrown, which you
should check for if you installed a nonfatal error handler).

The file is then read one sequence at a time by calling
\ccode{esl\_sq\_Read(sqfp, sq)}. This function returns \ccode{eslOK}
if it read a new sequence, and leaves that sequence in the \ccode{sq}
object that the caller provided.  When there is no more data in the
file, \ccode{esl\_sq\_Read()} returns \ccode{eslEOF}. 

If at any point the file does not appear to be in the proper format,
\ccode{esl\_sq\_Read()} returns \ccode{eslEFORMAT}. The application
must check for this. The API provides a little information about what
went wrong and where. \ccode{sqfp->filename} is the name of the file
that we were parsing (not necessarily the same as \ccode{seqfile};
\ccode{sqfp->filename} can be a full pathname if we used an
\ccode{env} argument to look for \ccode{seqfile} in installed database
directories). \ccode{sqfp->linenumber} is the line number that we
failed at. \ccode{sqfp->errbuf} is a brief explanatory message that
gets filled in when a \ccode{eslEFORMAT} error occurs.
  \footnote{Unlike in the MSA module, you don't get access to the
  current line text; some of sqio's parsers use fast block-based
  (\ccode{fread()}) input instead of line-based input.}

We can reuse the same \ccode{ESL\_SQ} object for subsequent sequences
by calling \ccode{esl\_sq\_Reuse()} on it when we're done with the
previous sequence. If we wanted to load a set of sequences, we'd
\ccode{\_Create()} an array of \ccode{ESL\_SQ} objects. 

Finally, to clean up properly, a \ccode{ESL\_SQ} that was created is
destroyed with \ccode{esl\_sq\_Destroy(sq)}, and a \ccode{ESL\_SQFILE}
is closed with \ccode{esl\_sqfile\_Close()}.

\subsection{Accepted formats}

Accepted unaligned sequence file formats (and their Easel format
codes) are:

\begin{tabular}{ll}
\ccode{eslSQFILE\_DDBJ}     & DDBJ flat text DNA database format \\
\ccode{eslSQFILE\_EMBL}     & EMBL flat text DNA database format \\
\ccode{eslSQFILE\_FASTA}    & FASTA format \\
\ccode{eslSQFILE\_GENBANK}  & Genbank flat text DNA database format \\
\ccode{eslSQFILE\_UNIPROT}  & Uniprot flat text protein database format \\
\end{tabular}

Additionally, the code \ccode{eslSQFILE\_UNKNOWN} is recognized. It
tells \ccode{esl\_sqfile\_Open()} to perform format autodetection.

\subsection{Special input streams: stdin and compressed files}

There are two special cases for input files. 

The module can read sequence input from a stdin pipe. If the
\ccode{seqfile} argument is ``-'', \ccode{esl\_sqfile\_Open()} ``opens''
standard input (really, it just associates \ccode{stdin}, which is
always open, with the \ccode{ESL\_SQFILE}). 

The module can read compressed sequence files. If the \ccode{seqfile}
argument to \ccode{esl\_sqfile\_Open()} ends in \ccode{.gz}, the file is
assumed to be compressed with \ccode{gzip}; instead of opening it
normally, \ccode{esl\_sqfile\_Open()} opens it as a pipe from
\ccode{gunzip -dc}. Your system must support pipes to use this -
specifically, it must support the \ccode{popen()} system call (POSIX.2
compliant operating systems do). The \ccode{configure} script
automatically checks this at compile-time and defines
\ccode{HAVE\_POPEN} appropriately. Obviously, the user must also have
\ccode{gunzip} installed and in his PATH.

For both special cases, the catch is that you can't use format
autodetection; you must provide a valid known format code when you
read from stdin or from a compressed file. Pipes are not rewindable,
and format autodetection relies on a two-pass algorithm: it reads
partway into the file to determine the format, then rewinds to start
parsing for real.

\subsection{Augmentations}

The sqio module is optionally augmented by up to two additional
modules, as follows:

\subsubsection{msa: read unaligned sequences sequentially from an alignment}

If sqio is augmented with the msa module, then the sqio API gains the
ability to read alignment file formats in addition to unaligned file
formats. The sqio API remains exactly the same (the caller doesn't
have to use any msa module functions).

\subsubsection{alphabet: digitized sequences}

At present, only placeholders exist in the code for this augmentation.
The plan is to provide the ability to input sequences directly into
\ccode{dsq} as pre-digitized sequences.






