QC Analysis Report


{% for col in title_project %} {% for item in col %} {% endfor %} {% endfor %}
Contract Information Contract Content
{{item}}


Novogene Co., Ltd



A. Library Preparation and Sequencing

For pre-made Ultima libraries, Library QC is conducted prior to loading. If conversion is required, non-Ultima libraries must be stored separately in individual tubes and undergo a pre-conversion check. Qualified libraries will then have their adapters replaced with Ultima adapters to ensure compatibility with the UG100 system.

From the sample input to the final data, each step—including sample processing, library preparation, and sequencing—can impact data quality. To ensure reliability, quality control is implemented throughout the workflow.

1 Library Quality Control and Sequencing

Library quality control is then performed to confirm the loading concentration for sequencing on the Ultima UG100 platform. Once sequencing is complete, the data undergoes QC to generate a data QC report for settlement.

Novogene Co., Ltd



B. Results and Instructions

1 Data Quality Control

1.1 Distribution of Sequencing Quality

The “e” represents the sequence error rate and Qphred represents the base quality value,Qphred=-10log10(e). The relationship between sequencing error rate (e) and sequencing base quality value (Qphred) is shown in the table below:

Phred scoreError baseRight baseQ-score
101/1090%Q10
201/10099%Q20
301/100099.9%Q30
401/1000099.99%Q40

The distribution of quality score is shown in Fig.1:

Fig.1 Distribution of Sequencing Quality

The base position is on the horizontal axis and the sequencing quality is on the vertical axis.

Novogene Co., Ltd



1.2 Distribution of Sequencing Error Rate

The error rate of this project is shown in Fig.2:


Fig.2 Error Rate Distribution

The base position is on the horizontal axis and the single base error rate is on the vertical axis

Novogene Co., Ltd



1.3 Distribution of A/T/G/C Base

It is used to assess the distribution of GC content to determine the separation between AT and GC. According to the principle of complementary base pairing, the proportions of AT and GC should be equal at each sequencing cycle and remain constant throughout the sequencing process.

The distribution of GC content is shown in Fig.3:

Fig.3 A/T/G/C Distribution

The base position is on the horizontal axis and the single base percentage is on the vertical axis

Novogene Co., Ltd



1.4 Results of Raw Data Filtering

Sequenced reads (raw reads) often contain low-quality reads and adapter sequences, which can impact analysis quality. Therefore, filtering is necessary to obtain clean reads. The filtering process includes:

(1) Remove reads containing N > 10% (N represents the base cannot be determined).

(2) Remove reads containing low quality (Qscore<= 5) base which is over 50% of the total base.

The Sequencing data filtration of this project can be seen in Fig.4 :

Fig.4 Composition of Raw Data

Different color for different components:

(1)Containing N: (reads with more than 10% N) / (total raw reads)

(2)Low quality: (reads of low quality) / (total raw reads)

(3)Clean reads: (clean reads) / (total raw reads)

Novogene Co., Ltd



2 Summary of Sequencing Data Information

The total output of data on the sequencer: Raw data {{total_raw_readsnumber}} M rawreads; {{total_raw}} G.

The detail statistics for the quality of sequencing data are shown in Table 1.

Table 1 Data Quality Summary

{% if pml %} {% for item in table_qc_pml_head %}{% endfor %} {% for each in table_qc_pml %} {% for x in each %} {% if forloop.first %} {% if reptype == 'qc' %} {% if ifnreads %} {% else %} {% endif %} {% else %} {% endif %} {% else %} {% if reptype == 'qc' %} {% if ifnreads %} {% else %} {% endif %} {% else %} {% endif %} {% endif %} {% endfor %} {% endfor %} {% else %} {% for item in table_qc_head %}{% endfor %} {% for each in table_qc %} {% for x in each %} {% if forloop.first %} {% if 'DHE' in lib_type and reptype == 'qc' %} {% elif ifnreads %} {% if reptype == 'qc' %} {% else %} {% endif %} {% else %} {% if reptype == 'qc' %} {% else %} {% endif %} {% endif %} {% else %} {% if 'DHE' in lib_type and reptype == 'qc' %} {% elif ifnreads %} {% if reptype == 'qc' %} {% else %} {% endif %} {% else %} {% if reptype == 'qc' %} {% else %} {% endif %} {% endif %} {% endif %} {% endfor %} {% endfor %} {% endif %}
{{item}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}{{x.9}}{{x.10}}{{x.11}}{{x.12}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}{{x.9}}{{x.10}}{{x.11}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}{{x.9}}{{x.10}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}{{x.9}}{{x.10}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}{{x.9}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}
{{item}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}{{x.9}}{{x.10}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}{{x.9}}{{x.10}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}{{x.9}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}{{x.9}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}{{x.9}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}{{x.9}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}{{x.8}}
{{x.0}}{{x.1}}{{x.2}}{{x.3}}{{x.4}}{{x.5}}{{x.6}}{{x.7}}

{% if pml %}

Q1:Why is there a situation where "Lane Raw Base (G)" does not equal to the sum of "Raw Base (G)" of all samples plus "Raw Base (G)" of the Undetermined data?
A1: It is due to situations such as varying index lengths within the lane or where certain sample indexes are provided multiple times. This requires multiple re-demultiplexing operations from the BCL level, resulting in undetermined data containing sample data from other demultiplexing batches, hence the discrepancy.
Q2:What is Undetermined Data and Why Does it Occur?
A2:In high-throughput sequencing, reads are identified and allocated based on their associated index/barcode. Undetermined data occurs when the sequencer reads an index/barcode sequence that does not match any of the user-provided sequences.
Potential reasons for the generation of undetermined data include:
1. Errors in the provided index/barcode or the absence of index/barcode for some samples
2. Base imbalance in index sequences within mixed libraries, making it difficult for the sequencer to accurately determine the index base sequences.
3. Incomplete adapter ligation during library preparation.
4. Cross-contamination among samples during library preparation.
5. Sequencing errors in index reads, as no sequencer achieves 100% accuracy in base calling, leading to undetermined reads.
6. If PhiX was spiked in during sequencing, PhiX will be detected as undetermined data.
Q3:What is the Use of Undetermined Data?
A2:For samples with incorrect or missing index/barcode information, it is possible to attempt further sorting of reads with specific indexes from the undetermined data. This process can help recover and utilize data that would otherwise be unassigned, thereby enhancing the overall effectiveness of high-throughput sequencing analyses.
{% endif %}

Novogene Co., Ltd



C. Appendix

1 Introduction of Sequencing Data Format

The original raw data from UG100 platform are transformed to Sequenced Reads, known as Raw Data or RAW Reads, by base calling. Raw data are recorded in a FASTQ file, which contains sequencing reads and corresponding sequencing quality. Every read in FASTQ format is stored in four lines, as indicated below (Cock P.J.A. et al. 2010):

@V150:418291:NA:NA:1:1:1:277:13:1:868:N:0.756:CAGTTCATCTGTGAT:NA:1379   
NAAGAACACGTTCGGTCACCTCAGCACACTTGTGAATGTCATGGGATCCAT
+
#55???BBBBB?BA@DEEFFCFFHHFFCFFHHHHHHHFAE0ECFFD/AEHH

Line 1 begins with a '@' character and is followed by the Ultima Sequence Identifiers and an optional description.

IdentifierMeaning
V150Instrument ID
418291Run ID
NAreserved (NA)
NAreserved (NA)
1Camera
1Ring
1Tile
277X pos
13Y pos
1Segment num
868First flow signal
NFiltered (Y/N)
0.756RSQ (Read quality)
CAGTTCATCTGTGATBarcode
NAUMI
1379Bead index

Line 2 is the raw sequence of the read.

Line 3 begins with a '+' character and is optionally followed by the same sequence identifiers and descriptions as in Line 1.

Line 4 encodes the quality values for the bases in Line 2 and contains the same number of characters as the bases in the read (Cock, 2009.).


2 Explanation of Sequencing Data Related

(1) Q-Score Comparison It is not recommended to directly compare Q-scores between the Ultima sequencing platform and other sequencing platforms (e.g., Illumina) due to differences in how sequencing error rates are calculated. The Ultima platform determines error rates primarily based on base length accuracy, whereas the Illumina platform relies on fluorescence color, intensity, and background noise.

(2) Data Integrity Check The sequencing data is provided as a compressed file in the '.fq.gz' format. Before data delivery, we calculate the MD5 checksum for each compressed file, which should be verified upon receipt. In a Linux environment, use the command: md5sum -c <*md5.txt>. In a Windows environment, a checksum verification tool (e.g., HashMyFiles) can be used. If the MD5 value of the compressed file does not match the one provided in the MD5 file, the file may have been corrupted during transmission.

(3) Single-Cell Sequencing Data on UG100 For single-cell sequencing data generated on the UG100 platform, each sample consists of two data files: a Read 1 file and a Read 2 file. These files contain the same number of lines. In a Linux environment, you can verify this using the command: wc -l . The total line count divided by 4 gives the number of reads.

(4) Data Size and Storage The data size refers to the storage space occupied on the hard disk, which depends on the disk format and compression ratio. It does not affect the total number of sequenced bases. As a result, the file sizes of Read 1 and Read 2 may not be identical.

(5) Clean Data Delivery We will apply strict filtering standards to ensure high-quality data suitable for further research and publication. The data filtered using this standard has been recognized in high-impact publications (e.g., Yan L.Y. et al., 2013). For more details, please contact us.

(6) Read Processing The Read 1 and Read 2 sequences are assigned based on the index read. Since both reads represent sample sequences, there is no need to trim the beginning or end of the reads during downstream analysis (e.g., mapping).

(7) Data Retention Policy Outdated data will be deleted 30 days after data delivery. Please ensure that you store your data properly. If you have any questions or concerns, contact us as soon as possible.

3 Result File Decompression Method

Compressed.formatCustomer.typeUncompressed.method
compressed files in the fomat of *.tar:Unix/Linux/Mac useruse tar -xvf *.tar command
Windows useruse uncompressed software such as WinRAR, 7-Zip et al
compressed files in the format of *.gz:Unix/Linux/Mac useruse gzip –d *.gz command
Windows useruse uncompressed software such as WinRAR, 7-Zip et al
compressed files in the format of *.zip:Unix/Linux/Mac useruse unzip *.zip command
Windows useruse uncompressed software such as WinRAR, 7-Zip et al

4 References

Cock P.J.A. et al (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research 38, 1767-1771.

Hansen K.D. et al (2010). Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic acids research 38, e131-e131.

Erlich Y.et al (2008). Alta-Cyclic: a self-optimizing base caller for next-generation sequencing.Nature Methods,5,679-682.

Jiang L.C. et al (2011). Synthetic spike-in standards for RNA-seq experiments. Genome research 21, 1543-1551.

Yan L.Y. et al (2013). Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat Struct Mol Biol.