skip to main content
Roche logo
2. Data Files and Formats : 2.3 Standard File Formats : 2.3.1 Composite wells file format
The CWF file is a container format which stores multiple “streams” of information. The container itself is a “ZIP” file (http://www.pkware.com/documents/casestudies/APPNOTE.TXT) with a single level hierarchy. Each stream is named and compressed separately, allowing for rapid access to any information in the file. The CWF file format is inspired from the “OpenDocument format” described in ISO/IEC 26300:2006 (http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=43485). The OpenDocument format benefits from the segregation of concerns by separating the content, styles, metadata and application settings into four separate XML streams; the CWF format maintains a similar separation: flowgrams, called bases, meta data, processing history and run metrics are stored individually. The user should not normally unpack the CWF file, as each file is read as needed. Nonetheless, a C library (libcwf) is available that can read the CWF file format, for convenience.
Table 3 shows an example listing of a CWF file’s streams that might exist at the end of signal processing. This file represents the data from one region of a high-quality 2-region sequencing Run (GS FLX+ System), and the Table shows the size savings provided by the CWF compressed format. Each stream is described separately below.
2.3.1.1
mimetype
2.3.1.2
meta.xml
Original Run name. e.g. “R_2007_06_27_15_44_21_rig3_ccelone_1007075seqkit93555420PELTxxEX2xxVERIIF2”
<?xml version="1.0" encoding="UTF-8"?>
<Metadata xmlns:tns="http://purl.org/dc/terms/"
xmlns:tnsa="http://purl.org/dc/elements/1.1/"
xmlns:tnsb="http://purl.org/dc/dcmitype/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="GSDataProcessing-1.0.xsd">
<tnsa:title>R_2011_04_04_10_00_36_FLX08100646_Administrator_apr_2_build</tnsa:title>
<tnsa:type>flowgrams</tnsa:type>
<tnsa:creator>Administrator</tnsa:creator>
<tns:created>2011-04-04T10:00:36Z</tns:created>
<InstrumentSerialNumber>FLX08100646</InstrumentSerialNumber>
<InstrumentVersion>2.6 (20110324_1135)</InstrumentVersion>
<InstrumentModel>GSFLX</InstrumentModel>
<InstrumentConfiguration>FLX+</InstrumentConfiguration>
<Run>
<Name>apr_2_build</Name>
<RunId></RunId>
<Project>Applications II</Project>
<Kit>XL+KIT</Kit>
<Script>400x_TACG_70x75_XLPLUSKIT.icl</Script>
<RegionCount>2</RegionCount>
<RegionLayoutName>2_region</RegionLayoutName>
<PTP>
<ID>749573</ID>
<WellSize unit="um">35</WellSize>
<Size unit="mm">
<Width>70</Width>
<Height>75</Height>
</Size>
</PTP>
<Barcodes>
<Barcode>123456</Barcode>
</Barcodes>
<Flow>
<FlowCount>1603</FlowCount>
<CycleCount>400</CycleCount>
<ActualOrder>SSSSOOOOSOOOOSOOOOSOOOOSPSSSSTACG…TACGPS</ActualOrder>
<FlowOrder>PTACG…TACGP</FlowOrder>
</Flow>
<Images>
<ImageWidth>4096</ImageWidth>
<ImageHeight>4096</ImageHeight>
<DcOffset>495</DcOffset>
<MaxValue>16383</MaxValue>
</Images>
</Run>
<Region>
<Name>region 1</Name>
<Number>1</Number>
<TemplateBounds unit="pixel">
<Center>
<X>1024</X>
<Y>2048</Y>
</Center>
<Dimension>
<Width>2047</Width>
<Height>4095</Height>
</Dimension>
</TemplateBounds>
<RevisedBounds unit="pixel">
<Center>
<X>1024</X>
<Y>2048</Y>
</Center>
<Dimension>
<Width>2047</Width>
<Height>4095</Height>
</Dimension>
</RevisedBounds>
</Region>
<WellStatus>uncorrected</WellStatus>
<WellCount>1104008</WellCount>
</Metadata>
2.3.1.3
history.xml
<?xml version="1.0" encoding="UTF-8"?>
<History xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="GSDataProcessing-1.0.xsd">
<Job>
<ID>cc8dfc88-70ff-11e0-99ce-00215e70718c</ID>
<Name>cwf_with_instrumentConfiguration_take3</Name>
<ProcessingDirectoryName>D_2011_04_27_14_54_37_FLX08100646_imageProcessing
Only_cwf_with_instrumentConfiguration_take3</ProcessingDirectoryName>
<OS>Linux</OS>
<StartTime>2011-04-27T18:54:38Z</StartTime>
<TotalJobSeconds>12435</TotalJobSeconds>
<PartialJobSeconds>11927</PartialJobSeconds>
<GsRunProcessorVersion>2.6</GsRunProcessorVersion>
<GsRunProcessorBuild>20110426_1019</GsRunProcessorBuild>
<Host>FLX08100646</Host>
<NumDataSetsInJob>2</NumDataSetsInJob>
<NumProcessors>4</NumProcessors>
<PartialNumProcessors>2</PartialNumProcessors>
<Type>imageProcessingOnly</Type>
<Pipeline>imageProcessingOnly</Pipeline>
<DisplayText>Image processing only</DisplayText>
<Description>
This is the default pipeline for taking the raw images and
making intermediate files that can be analyzed by
other pipelines.
</Description>
<CommandLine>
--run=/data/2011_04_04/R_2011_04_04_10_00_36_FLX08100646_Administrator
_apr_2_build/D_2011_04_27_14_54_37_FLX08100646_imageProcessingOnly_cwf_with_instrument
Configuration_take3/dataRunParams.xml
--imageLog=/data/2011_04_04/R_2011_04_04_10_00_36_FLX08100646_Administrator
_apr_2_build/imageLog.parse
--images=/data/2011_04_04/R_2011_04_04_10_00_36_FLX08100646_Administrator
_apr_2_build/rawImages/
--out=/data/2011_04_04/R_2011_04_04_10_00_36_FLX08100646_Administrator
_apr_2_build/D_2011_04_27_14_54_37_FLX08100646_imageProcessingOnly_cwf_with_instrument
Configuration_take3/regions
--log=/data/2011_04_04/R_2011_04_04_10_00_36_FLX08100646_Administrator
_apr_2_build/D_2011_04_27_14_54_37_FLX08100646_imageProcessingOnly_cwf_with_instrument
Configuration_take3/gsRunProcessor.log
--error=/data/2011_04_04/R_2011_04_04_10_00_36_FLX08100646_Administrator
_apr_2_build/D_2011_04_27_14_54_37_FLX08100646_imageProcessingOnly_cwf_with_instrument
Configuration_take3/gsRunProcessor_err.log
--job=cc8dfc88-70ff-11e0-99ce-00215e70718c
--pipe=/usr/local/rig/apps/gsRunProcessor/etc/gsRunProcessor/imageProcessingOnly.xml
--name=cwf_with_instrumentConfiguration_take3
--progress
--remoteProgress=localhost:4540
</CommandLine>
<ParamsUsed>
<CameraClassifier computeTime="6" name="CameraClassifier">
<enable>true</enable>
<removeHotPixels>true</removeHotPixels>
<imageScaleFactor>1</imageScaleFactor>
<hotPixelThreshold>500</hotPixelThreshold>
<computeSharpnessMask>true</computeSharpnessMask>
<minPPI>30</minPPI>
<standardDeviationLimit>2.2</standardDeviationLimit>
<sampleBlockSize>128</sampleBlockSize>
<fftRadius>45</fftRadius>
</CameraClassifier>
<WellFinder computeTime="398" name="WellFinder">
<enable>true</enable>
<imageScaleFactor>1</imageScaleFactor>
<kernelSize>51</kernelSize>
<upsampleHighDensityPtps>true</upsampleHighDensityPtps>
<minPPISignal>30</minPPISignal>
<minConsensusSignal>20</minConsensusSignal>
<minWellSpacing>3</minWellSpacing>
<secondSearchPass>false</secondSearchPass>
<blockSize>20000</blockSize>
<upsampleFactor>2</upsampleFactor>
<maskBeta>0.6</maskBeta>
<maskAlpha>0.1</maskAlpha>
<numPixelsPerWell>1</numPixelsPerWell>
<morphologyThresholdMultiplier>1</morphologyThresholdMultiplier>
<morphologyNumInARow>5</morphologyNumInARow>
</WellFinder>
<WellBuilder computeTime="11520" name="WellBuilder">
<enable>true</enable>
<kernelSize>51</kernelSize>
<minPPISignal>30</minPPISignal>
<scaleFactor>1</scaleFactor>
<useBicubic>true</useBicubic>
<usePpiInterpolation>false</usePpiInterpolation>
<firstFlowToInterpolate>20</firstFlowToInterpolate>
<imageScaleFactor>1</imageScaleFactor>
<skipBackgroundSubtraction>false</skipBackgroundSubtraction>
</WellBuilder>
<MetricsGenerator computeTime="1" name="MetricsGenerator">
<enable>true</enable>
</MetricsGenerator>
</ParamsUsed>
<FinishTime>2011-04-27T22:21:58Z</FinishTime>
<TotalUserSeconds>37614</TotalUserSeconds>
<TotalSystemSeconds>3974</TotalSystemSeconds>
</Job>
</History>
2.3.1.4
location.idx
This is a binary index to the wells files. It contains common data about each well that can be used to support a well browser-style application. Items in this stream are stored in Intel Little-Endian format (i.e. the rank is stored as Byte3 Byte2 Byte1 Byte0). The wells file contains one field per well, and each field is made up of the packed structure shown in Figure 10:
2.3.1.5
metrics.xml
<?xml version="1.0" encoding="utf-8"?>
<Metrics xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="GSDataProcessing-1.0.xsd">
<RunMetrics>
<MaxWellCount>529820</MaxWellCount>
<RawWellCount>468698</RawWellCount>
<SampleKeyPassWellCount>451747</SampleKeyPassWellCount>
<ControlKeyPassWellCount>7818</ControlKeyPassWellCount>
<ControlKeys>
<Key>ATGC</Key>
</ControlKeys>
<SampleKeys>
<Key>TCAG</Key>
</SampleKeys>
<Streams>
<Stream type="rawWellDensity">
<StreamName>rawWellDensity.pgm</StreamName>
<DataType>image</DataType>
</Stream>
<Stream type="carryForwardCorrections">
<StreamName>cfValues.double.dat</StreamName>
<DataType>double</DataType>
</Stream>
<Stream type="incompleteExtensionCorrections">
<StreamName>ieValues.double.dat</StreamName>
<DataType>double</DataType>
</Stream>
<Stream type="filterResults">
<StreamName>filterResults.uint8.dat</StreamName>
<DataType>byte</DataType>
</Stream>
<Stream type="signalPerBase">
<StreamName>signalPerBase.float.dat</StreamName>
<DataType>float</DataType>
</Stream>
</Streams>
<Other>
<NukeSignalStrengthBalancer>
<medianOneMerA>1.09085</medianOneMerA>
<medianOneMerT>0.909846</medianOneMerT>
<medianOneMerG>0.898174</medianOneMerG>
<medianOneMerC>0.894138</medianOneMerC>
</NukeSignalStrengthBalancer>
<BlowByCorrector>
<droopLambda>-0.00171434</droopLambda>
<MedianSignal>1375.71</MedianSignal>
<MaximumSignal>5086.11</MaximumSignal>
<MedianDensity>12</MedianDensity>
<MinimumDensity>1</MinimumDensity>
<MaximumDensity>19</MaximumDensity>
<num_low_density_low_signal_wells>
14742</num_low_density_low_signal_wells>
<num_high_density_low_signal_wells>
10481</num_high_density_low_signal_wells>
<num_low_density_high_signal_wells>
13645</num_low_density_high_signal_wells>
<num_high_density_high_signal_wells>
14899</num_high_density_high_signal_wells>
<mask_averaging_used>true</mask_averaging_used>
<FinalMask>
<class density="high" signal="high" class="0">
<epsilon>0.174537</epsilon>
<beta>0.964658</beta>
</class>
<class density="low" signal="high" class="1">
<epsilon>0.188713</epsilon>
<beta>0.988837</beta>
</class>
<class density="high" signal="low" class="2">
<epsilon>0.171184</epsilon>
<beta>0.950913</beta>
</class>
<class density="low" signal="low" class="3">
<epsilon>0.184087</epsilon>
<beta>0.900547</beta>
</class>
</FinalMask>
</BlowByCorrector>
<CafieCorrector>
<droopLambda>-0.00158241</droopLambda>
</CafieCorrector>
<NukeSignalStrengthBalancer>
<medianOneMerA>1.01745</medianOneMerA>
<medianOneMerT>0.986296</medianOneMerT>
<medianOneMerG>0.987408</medianOneMerG>
<medianOneMerC>0.980024</medianOneMerC>
</NukeSignalStrengthBalancer>
</Other>
</RunMetrics>
</Metrics>
2.3.1.6
sequences.xml
<?xml version="1.0" encoding="iso-8859-1"?>
<Sequences>
<Sequence Type="None">
<ID>0</ID>
<Name>unknown</Name>
<Key></Key>
<Seq></Seq>
</Sequence>
<Sequence Type="Control">
<ID>1</ID>
<Name>ATGC-control</Name>
<Key>ATGC</Key>
<Seq>ATGC</Seq>
</Sequence>
<Sequence Type="Library">
<ID>2</ID>
<Name>TCAG-key</Name>
<Key>TCAG</Key>
<Seq>TCAG</Seq>
</Sequence>
<Sequence Type="Control">
<ID>3</ID>
<Name>TF2LonG</Name>
<Key>ATGC</Key>
<Seq>ATGCCA...TGTGTG</Seq>
</Sequence>
<Sequence Type="Control">
<ID>4</ID>
<Name>TF7LonG</Name>
<Key>ATGC</Key>
<Seq>ATGC...TTCCTGTGTG</Seq>
</Sequence>
<Sequence Type="Control">
<ID>5</ID>
<Name>TF90LonG</Name>
<Key>ATGC</Key>
<Seq>ATGCCGCA...GTGTG</Seq>
</Sequence>
<Sequence Type="Control">
<ID>6</ID>
<Name>TF100LonG</Name>
<Key>ATGC</Key>
<Seq>ATGCAT...GTGTG</Seq>
</Sequence>
<Sequence Type="Control">
<ID>7</ID>
<Name>TF120LonG</Name>
<Key>ATGC</Key>
<Seq>ATGCA...CCTGTGTG</Seq>
</Sequence>
<Sequence Type="Control">
<ID>8</ID>
<Name>TF150MMP7A</Name>
<Key>ATGC</Key>
<Seq>ATGCGC...ATGG</Seq>
</Sequence>
</Sequences>
2.3.1.7
filters.xml
A list of filters referred to by the values in the “filterResults.uint8.dat” stream (see section 2.3.1.8, below). Note that the order of filters in this file is not guaranteed. It is also likely that the filters will be reorganized in a future release of the software to provide more detail. An example filters.xml file is shown in Figure 13.
<?xml version="1.0" encoding="utf-8"?>
<Filters xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="GSDataProcessing-1.0.xsd">
<Filter basic="true">
<ID>0</ID>
<Name>Pass</Name>
</Filter>
<Filter basic="true">
<ID>1</ID>
<Name>No Key</Name>
</Filter>
<Filter basic="true">
<ID>2</ID>
<Name>Bad Band</Name>
</Filter>
<Filter basic="true">
<ID>3</ID>
<Name>Trimmed Too Short Quality</Name>
</Filter>
<Filter basic="true">
<ID>4</ID>
<Name>Low Pass Filter</Name>
</Filter>
<Filter basic="true">
<ID>5</ID>
<Name>Classifier Filter</Name>
</Filter>
<Filter basic="true">
<ID>6</ID>
<Name>Dot Filter</Name>
</Filter>
<Filter basic="true">
<ID>7</ID>
<Name>Mixed Filter</Name>
</Filter>
<Filter basic="true">
<ID>8</ID>
<Name>Trimmed Too Short Primer</Name>
</Filter>
<Filter basic="true">
<ID>9</ID>
<Name>Low Quality</Name>
</Filter>
</Filters>
2.3.1.8
filterResults.uint8.dat
The half-precision floating point is a relatively new binary floating point format that uses 2 bytes and which is not covered by the IEEE 754 standard for encoding floating point numbers (but is included in the IEEE 754r proposed revision; http://www.validlab.com/754R/). The format uses 1 sign bit, a 5-bit excess-15 exponent, 10 mantissa bits (with an implied 1 bit) and all the standard IEEE rules. The minimum and maximum representable values are 2.98×10-8 and 65504, respectively. Libcwf includes a half to full precision floating point conversion routine.
2.3.1.9
Base Called Data
The total number of bytes consumed by a read is reflected in the first field of the baseCalledSeq.dat. Therefore these two bytes can be used as an index of sorts. For example, the byte offset in the “dna” file for read 100 can be found by summing the first two bytes of the first 99 entries in baseCalledSeq.dat. The 100th entry can then be used to tell how many bytes are available for read 100. It is important to note that since the basecalls are stored in blocks, one must first find the appropriate block, then compute the offset from there. Again, users of the CWF format are encouraged to use the libcwf to insulate them from errors in extracting the base information.
2.3.1.10
Other Public Streams
The possible data types are listed in Table 5, and the current stream types, in Table 6.
Note on Image Formats: The only image format stored in CWF is the lossless Portable Anymap format, specifically the “P5” PGM (Portable Graymap) variant (http://netpbm.sourceforge.net/doc/pgm.html). To save space, only the area encompassed by the region is included in the image. To properly register the image against the original PTP device, you must offset the image slice by the region boundary. This region boundary can be read in the “RevisedBounds” element of the “Region” block in meta.xml stream. 454 Life Sciences Corporation may introduce other image formats in future variants of the CWF file, so it is important to read the “magic number” and/or file extension of the image to determine the correct image decoder to use.
2.3.1.11
Other Private Streams
2.3.1.12
Other Files