QC Analysis Report
| Contract Information | Contract Content |
|---|---|
| {{item}} | {% endfor %}
A. Library Preparation and Sequencing
For pre-made Ultima libraries, Library QC is conducted prior to loading. If conversion is required, non-Ultima libraries must be stored separately in individual tubes and undergo a pre-conversion check. Qualified libraries will then have their adapters replaced with Ultima adapters to ensure compatibility with the UG100 system.
From the sample input to the final data, each step—including sample processing, library preparation, and sequencing—can impact data quality. To ensure reliability, quality control is implemented throughout the workflow.
1 Library Quality Control and Sequencing
Library quality control is then performed to confirm the loading concentration for sequencing on the Ultima UG100 platform. Once sequencing is complete, the data undergoes QC to generate a data QC report for settlement.
B. Results and Instructions
1 Data Quality Control
1.1 Distribution of Sequencing Quality
The “e” represents the sequence error rate and Qphred represents the base quality value,Qphred=-10log10(e). The relationship between sequencing error rate (e) and sequencing base quality value (Qphred) is shown in the table below:
| Phred score | Error base | Right base | Q-score |
|---|---|---|---|
| 10 | 1/10 | 90% | Q10 |
| 20 | 1/100 | 99% | Q20 |
| 30 | 1/1000 | 99.9% | Q30 |
| 40 | 1/10000 | 99.99% | Q40 |
The distribution of quality score is shown in Fig.1:
Fig.1 Distribution of Sequencing Quality
The base position is on the horizontal axis and the sequencing quality is on the vertical axis.
1.2 Distribution of Sequencing Error Rate
The error rate of this project is shown in Fig.2:
Fig.2 Error Rate Distribution
The base position is on the horizontal axis and the single base error rate is on the vertical axis
1.3 Distribution of A/T/G/C Base
It is used to assess the distribution of GC content to determine the separation between AT and GC. According to the principle of complementary base pairing, the proportions of AT and GC should be equal at each sequencing cycle and remain constant throughout the sequencing process.
The distribution of GC content is shown in Fig.3:
Fig.3 A/T/G/C Distribution
The base position is on the horizontal axis and the single base percentage is on the vertical axis
1.4 Results of Raw Data Filtering
Sequenced reads (raw reads) often contain low-quality reads and adapter sequences, which can impact analysis quality. Therefore, filtering is necessary to obtain clean reads. The filtering process includes:
(1) Remove reads containing N > 10% (N represents the base cannot be determined).
(2) Remove reads containing low quality (Qscore<= 5) base which is over 50% of the total base.
The Sequencing data filtration of this project can be seen in Fig.4 :
Fig.4 Composition of Raw Data
Different color for different components:
(1)Containing N: (reads with more than 10% N) / (total raw reads)
(2)Low quality: (reads of low quality) / (total raw reads)
(3)Clean reads: (clean reads) / (total raw reads)
2 Summary of Sequencing Data Information
The total output of data on the sequencer: Raw data {{total_raw_readsnumber}} M rawreads; {{total_raw}} G.
The detail statistics for the quality of sequencing data are shown in Table 1.
Table 1 Data Quality Summary
| {{item}} | {% endfor %}||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | {{x.9}} | {{x.10}} | {{x.11}} | {{x.12}} |
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | {{x.9}} | {{x.10}} | {{x.11}} | |
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | {{x.9}} | {{x.10}} | ||
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | {{x.9}} | {{x.10}} | ||
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | {{x.9}} | |||
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | ||||
| {{item}} | {% endfor %}||||||||||||
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | {{x.9}} | {{x.10}} | ||
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | {{x.9}} | {{x.10}} | ||
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | {{x.9}} | |||
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | {{x.9}} | |||
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | ||||
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | {{x.9}} | |||
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | {{x.9}} | |||
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | ||||
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} | {{x.8}} | ||||
| {{x.0}} | {{x.1}} | {{x.2}} | {{x.3}} | {{x.4}} | {{x.5}} | {{x.6}} | {{x.7}} |
| Q1:Why is there a situation where "Lane Raw Base (G)" does not equal to the sum of "Raw Base (G)" of all samples plus "Raw Base (G)" of the Undetermined data? |
| A1: It is due to situations such as varying index lengths within the lane or where certain sample indexes are provided multiple times. This requires multiple re-demultiplexing operations from the BCL level, resulting in undetermined data containing sample data from other demultiplexing batches, hence the discrepancy. |
| Q2:What is Undetermined Data and Why Does it Occur? |
| A2:In high-throughput sequencing, reads are identified and allocated based on their associated index/barcode. Undetermined data occurs when the sequencer reads an index/barcode sequence that does not match any of the user-provided sequences. Potential reasons for the generation of undetermined data include: 1. Errors in the provided index/barcode or the absence of index/barcode for some samples 2. Base imbalance in index sequences within mixed libraries, making it difficult for the sequencer to accurately determine the index base sequences. 3. Incomplete adapter ligation during library preparation. 4. Cross-contamination among samples during library preparation. 5. Sequencing errors in index reads, as no sequencer achieves 100% accuracy in base calling, leading to undetermined reads. 6. If PhiX was spiked in during sequencing, PhiX will be detected as undetermined data. |
| Q3:What is the Use of Undetermined Data? |
| A2:For samples with incorrect or missing index/barcode information, it is possible to attempt further sorting of reads with specific indexes from the undetermined data. This process can help recover and utilize data that would otherwise be unassigned, thereby enhancing the overall effectiveness of high-throughput sequencing analyses. |
C. Appendix
1 Introduction of Sequencing Data Format
The original raw data from UG100 platform are transformed to Sequenced Reads, known as Raw Data or RAW Reads, by base calling. Raw data are recorded in a FASTQ file, which contains sequencing reads and corresponding sequencing quality. Every read in FASTQ format is stored in four lines, as indicated below (Cock P.J.A. et al. 2010):
@V150:418291:NA:NA:1:1:1:277:13:1:868:N:0.756:CAGTTCATCTGTGAT:NA:1379
NAAGAACACGTTCGGTCACCTCAGCACACTTGTGAATGTCATGGGATCCAT
+
#55???BBBBB?BA@DEEFFCFFHHFFCFFHHHHHHHFAE0ECFFD/AEHH
Line 1 begins with a '@' character and is followed by the Ultima Sequence Identifiers and an optional description.
| Identifier | Meaning |
|---|---|
| V150 | Instrument ID |
| 418291 | Run ID |
| NA | reserved (NA) |
| NA | reserved (NA) |
| 1 | Camera |
| 1 | Ring |
| 1 | Tile |
| 277 | X pos |
| 13 | Y pos |
| 1 | Segment num |
| 868 | First flow signal |
| N | Filtered (Y/N) |
| 0.756 | RSQ (Read quality) |
| CAGTTCATCTGTGAT | Barcode |
| NA | UMI |
| 1379 | Bead index |
Line 2 is the raw sequence of the read.
Line 3 begins with a '+' character and is optionally followed by the same sequence identifiers and descriptions as in Line 1.
Line 4 encodes the quality values for the bases in Line 2 and contains the same number of characters as the bases in the read (Cock, 2009.).
2 Explanation of Sequencing Data Related
(1) Q-Score Comparison
It is not recommended to directly compare Q-scores between the Ultima sequencing platform and other sequencing platforms (e.g., Illumina) due to differences in how sequencing error rates are calculated. The Ultima platform determines error rates primarily based on base length accuracy, whereas the Illumina platform relies on fluorescence color, intensity, and background noise.
(2) Data Integrity Check
The sequencing data is provided as a compressed file in the '.fq.gz' format. Before data delivery, we calculate the MD5 checksum for each compressed file, which should be verified upon receipt.
In a Linux environment, use the command: md5sum -c <*md5.txt>.
In a Windows environment, a checksum verification tool (e.g., HashMyFiles) can be used. If the MD5 value of the compressed file does not match the one provided in the MD5 file, the file may have been corrupted during transmission.
(3) Single-Cell Sequencing Data on UG100
For single-cell sequencing data generated on the UG100 platform, each sample consists of two data files: a Read 1 file and a Read 2 file. These files contain the same number of lines. In a Linux environment, you can verify this using the command: wc -l (4) Data Size and Storage
The data size refers to the storage space occupied on the hard disk, which depends on the disk format and compression ratio. It does not affect the total number of sequenced bases. As a result, the file sizes of Read 1 and Read 2 may not be identical. (5) Clean Data Delivery
We will apply strict filtering standards to ensure high-quality data suitable for further research and publication. The data filtered using this standard has been recognized in high-impact publications (e.g., Yan L.Y. et al., 2013). For more details, please contact us. (6) Read Processing
The Read 1 and Read 2 sequences are assigned based on the index read. Since both reads represent sample sequences, there is no need to trim the beginning or end of the reads during downstream analysis (e.g., mapping). (7) Data Retention Policy
Outdated data will be deleted 30 days after data delivery. Please ensure that you store your data properly. If you have any questions or concerns, contact us as soon as possible.
Cock P.J.A. et al (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research 38, 1767-1771.
3 Result File Decompression Method
Compressed.format Customer.type Uncompressed.method compressed files in the fomat of *.tar: Unix/Linux/Mac user use tar -xvf *.tar command Windows user use uncompressed software such as WinRAR, 7-Zip et al compressed files in the format of *.gz: Unix/Linux/Mac user use gzip –d *.gz command Windows user use uncompressed software such as WinRAR, 7-Zip et al compressed files in the format of *.zip: Unix/Linux/Mac user use unzip *.zip command Windows user use uncompressed software such as WinRAR, 7-Zip et al
4 References
Hansen K.D. et al (2010). Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic acids research 38, e131-e131.
Erlich Y.et al (2008). Alta-Cyclic: a self-optimizing base caller for next-generation sequencing.Nature Methods,5,679-682.
Jiang L.C. et al (2011). Synthetic spike-in standards for RNA-seq experiments. Genome research 21, 1543-1551.
Yan L.Y. et al (2013). Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat Struct Mol Biol.


Sample: sample name
Library_Flowcell_Lane: Library ID_Flowcell ID_lane ID, for raw data file naming.
Raw reads: total amount of reads of raw data, each four lines taken as one unit. For paired-end sequencing, it equals the amount of read1 and read2, otherwise it equals the amount of read1 for single-end sequencing.
Raw data: (Raw reads) * (sequence length), calculating in G. For paired-end sequencing like PE150, sequencing length equals 150, otherwise it equals 50 for sequencing like SE50.
Effective: (Clean reads/Raw reads)*100%
Error: base error rate
{% if 10Xread2 %} Q20, Q30 of read2: (Base count of Phred value > 20 or 30) / (Total base count)
{% else %} Q20, Q30: (Base count of Phred value > 20 or 30) / (Total base count)
{% endif %} GC: (G & C base count) / (Total base count)
{% if ifnreads %} N reads:Number of reads with N
{% endif %} {% if pml %} Flowcell Lane: The lane ID in Flowcell
Undetermined: For full lane sequencing projects, the undetermined data will be displayed. Undetermined data refers to the reads that could not be assigned to any specific sample during the demultiplexing process. This can happen when the index sequences provided for some samples are incorrect or when no index sequences are provided. These undetermined reads are typically sorted into an "Undetermined" fastq file. To salvage reads from these undetermined files, one can attempt to further demultiplex the reads based on specific index sequences. {% endif %}