Understanding and Interpreting Sequencing Quality Scores

Understanding and Interpreting Sequencing Quality Scores

Published:

By Jeremy Weaver

In our field of genomics research, we often encounter sequencing quality scores, also known as Q scores. These scores play a crucial role in analyzing sequencing data and are assigned to each base in a read using a phred-like algorithm. Higher Q scores indicate a smaller probability of error, while lower Q scores can lead to unusable reads and inaccurate conclusions.

Sequencing quality scores are particularly important in next-generation sequencing (NGS) applications, where accuracy is paramount. For example, Q20 represents an error rate of 1 in 100, with a call accuracy of 99%. Q30, on the other hand, is considered a benchmark for quality, indicating virtually error-free reads.

Understanding and interpreting sequencing quality scores is essential for obtaining reliable and accurate results in various NGS applications, including base calling and Sanger sequencing. These scores serve as a benchmark for quality and guide us in making informed decisions during data analysis.

What is a Quality Score in Sequencing?

In sequencing, a quality score represents the estimated probability of a base call being incorrect. It is a crucial metric that assesses the accuracy of sequencing data. The quality score is assigned to each base in a read using a phred-like algorithm. This algorithm calculates the Q score based on the formula -10log10(e), where e is the estimated probability of an incorrect base call. Higher quality scores indicate a smaller probability of error and greater base call accuracy.

Quality scores are particularly important in next-generation sequencing (NGS) technologies, such as SBS (Sequencing by Synthesis). These scores help researchers determine the reliability and accuracy of sequencing data for various applications, including genomic research, clinical diagnostics, and personalized medicine.

By understanding the definition and significance of quality scores in sequencing, researchers can make informed decisions in data analysis and ensure the reliability of their results.

Table: Quality Score Examples

Quality Score (Q) Error Probability Base Call Accuracy
Q20 1 in 100 99%
Q30 1 in 1,000 99.9%
Q40 1 in 10,000 99.99%

Importance of Sequencing Quality Scores

Sequencing quality scores play a crucial role in determining the usability and accuracy of sequencing reads. The quality scores assigned to each base in a read provide valuable information about the probability of an incorrect base call. Lower quality scores can result in a significant portion of the reads being unusable, as they may contain errors or ambiguities. These unusable reads can have a detrimental impact on downstream analyses, leading to false-positive variant calls and inaccurate conclusions.

One of the benchmark standards for quality in next-generation sequencing (NGS) is a quality score of Q30, where virtually all reads are error-free and reliable. This level of accuracy is particularly important in applications such as clinical research, where reliable results are essential for making informed decisions regarding patient care and treatment plans. By understanding and considering sequencing quality scores, researchers can improve the reliability and accuracy of their sequencing data, ensuring meaningful and valid conclusions.

Impact on Unusable Reads and False-Positive Variant Calls

Sequencing quality scores serve as a measure of the confidence we can place in each base call within a read. When quality scores are low, a significant number of reads may be labeled as unusable due to the higher probability of errors or ambiguities. These unusable reads can negatively impact downstream analyses, limiting the amount of reliable data available for interpretation.

Additionally, low-quality scores can lead to an increased number of false-positive variant calls. False-positive variant calls can occur when low-quality bases are misinterpreted as genetic variants, potentially skewing the analysis and resulting in inaccurate conclusions. By prioritizing high-quality reads and understanding the impact of low-quality scores on variant calling, researchers can minimize the risk of false-positive results and ensure the accuracy of their findings.

Sequencing Quality Scores Probability of Error Usability of Reads False-Positive Variant Calls
Q30 1 in 1000 Virtually all reads are error-free and reliable Minimized risk of false-positive variant calls
Q20 1 in 100 Lower probability of errors, but some reads may still be unusable Higher risk of false-positive variant calls
Below Q20 Higher than 1 in 100 Significant portion of reads may be unusable Increased risk of false-positive variant calls

Understanding Phred Quality Score Encoding

In sequencing data analysis, Phred quality score encoding is a standardized method used to assign a confidence level to each base call. It represents the probability of an incorrect base call at a specific position. The encoding is based on ASCII codes, which allow for easier interpretation and analysis of sequencing data while preserving the same information as the numerical representation.

Phred quality scores are initially represented as numbers ranging from 0 to 40. However, to reduce file size and facilitate data analysis, these scores are encoded using ASCII codes. The assignment of ASCII codes is based on a specific formula, ensuring that the encoded scores accurately represent the corresponding numerical values. This encoding scheme is widely used in sequencing data analysis software and tools.

Understanding Phred quality score encoding is crucial for proper interpretation of sequencing data. By examining the quality scores assigned to each base call, researchers can assess the reliability and accuracy of the data. These scores provide valuable insights into the probability of incorrect base calls and allow for filtering and analysis based on base-calling accuracy.

Phred Quality Score Encoding Example

To illustrate the Phred quality score encoding, let’s consider a sequencing read with the following quality scores:


Position Quality Score (ASCII) Quality Score (Numerical)
1 ! 0
2 1
3 # 2
4 $ 3
5 % 4

In this example, the ASCII code “!” corresponds to a numerical quality score of 0, while the ASCII code “#” corresponds to a numerical quality score of 2. By decoding the ASCII-encoded quality scores, researchers can gain insights into the overall quality of the sequencing data and make informed decisions regarding data analysis and interpretation.

Interpreting Quality Scores in Data Analysis

When it comes to data analysis in sequencing, understanding and interpreting quality scores is of utmost importance. Quality scores provide us with a measure of confidence in the accuracy of base calls and play a crucial role in assessing the reliability of sequencing data. Many software programs use quality scores as filtering criteria to determine the suitability of bases or entire reads for further analysis. By setting thresholds based on mean quality scores, researchers can ensure the inclusion of high-quality data and filter out reads that may introduce errors or inaccuracies.

Quality scores also play a significant role in SNP analysis, where the accuracy of variant calls depends on the reliability of base calls. Software programs that perform variant detection and SNP calling algorithms utilize quality scores to differentiate between genuine variants and sequencing errors. By considering quality scores as a filtering criterion, researchers can obtain more accurate and reliable results in their analyses.

It is important to note that base-calling accuracy is closely tied to quality scores. Higher quality scores indicate a smaller probability of error in base calls, leading to more accurate and reliable results in data analysis. Understanding the relationship between quality scores and base-calling accuracy is key to making informed decisions and ensuring the validity of sequencing results.

Example of Software that Uses Quality Scores:

Software Description
Samtools A suite of programs for interacting with high-throughput sequencing data. It uses quality scores for variant calling and filtering.
GATK A widely-used toolkit for variant discovery in high-throughput sequencing data. It utilizes quality scores to assess the accuracy and reliability of variant calls.
BWA A fast and accurate alignment tool for mapping sequencing reads to a reference genome. It takes quality scores into account during the alignment process.

In conclusion, interpreting quality scores in data analysis is essential for ensuring the reliability and accuracy of sequencing results. By understanding the confidence in base calls provided by quality scores and using appropriate filtering criteria, researchers can obtain more accurate and meaningful insights from their sequencing data.

Assessing Sequencing Quality with FastQC

Next-generation sequencing (NGS) data quality evaluation is a critical step in ensuring the reliability and accuracy of sequencing results. FastQC is a powerful tool that provides comprehensive insights into the quality metrics of sequencing data. It generates HTML reports that contain valuable information about per base sequence quality, mean quality scores, GC content, and more. These metrics help researchers assess the overall quality of their sequencing data and make informed decisions regarding data analysis and interpretation.

In particular, the per base sequence quality plot in FastQC offers vital information about the distribution of quality scores at each position in the read. By analyzing this plot, researchers can identify any quality issues that may affect the accuracy and reliability of their data. Additionally, other modules in the FastQC report provide insights into sequence content, GC distribution, and the presence of overrepresented sequences. This comprehensive analysis of various quality metrics enables researchers to evaluate the quality of their NGS data thoroughly.

To illustrate the importance of quality assessment, let’s consider an example. Table 1 shows the per base sequence quality scores for two sequencing runs, labeled Run A and Run B. In Run A, the quality scores remain consistently high throughout the reads, indicating excellent data quality. However, in Run B, there is a noticeable drop in quality scores towards the 3′ end of the reads. This quality drop could be attributed to sequencing error profiles such as signal decay or phasing. By identifying and understanding these quality issues, researchers can make informed decisions about the suitability of their sequencing data for further analysis.


Position Run A Quality Score Run B Quality Score
1 40 40
2 40 40
3 40 40

Understanding the Per Base Sequence Quality Plot

The per base sequence quality plot in FastQC provides valuable insights into the quality of sequencing reads at each position. It helps identify any issues or anomalies that may affect the accuracy and reliability of the data. When analyzing Illumina sequencing data, it is common to observe a drop in quality scores towards the end of the reads. This drop can be attributed to sequencing error profiles, including phasing and signal decay. Phasing refers to a loss of synchronicity in the sequencing signal, while signal decay refers to a decrease in the fluorescent signal intensity. Understanding these sequencing error profiles is essential for interpreting quality scores at different positions in the reads and assessing the overall sequencing data quality.

In the per base sequence quality plot, individual quality scores are represented by colored lines, allowing researchers to visualize the distribution of quality scores at each position. The plot typically shows a gradual decline in quality scores towards the end of the reads due to the aforementioned sequencing error profiles. However, it is important to note that excessive drops or consistent low-quality scores throughout the reads may indicate potential issues in the sequencing process.

Example: Distribution of Quality Scores in a Per Base Sequence Quality Plot

Position Quality Score
1 35
2 36
3 36
4 35
5 33
6 30
7 28
8 25
9 22
10 20

In this example, the quality scores gradually decrease from position 1 to 10, indicating a decline in the accuracy of base calls towards the end of the reads. Researchers need to be mindful of such patterns and consider possible implications on downstream analysis and data interpretation.

Expected and Worrisome Quality Issues

When analyzing sequencing data, it is important to be aware of the expected quality drops that occur as the sequencing run progresses. Signal decay and phasing are common factors contributing to these drops. Signal decay refers to a decrease in the fluorescent signal intensity with each cycle, leading to quality score drops towards the 3′ end of the reads. On the other hand, phasing occurs when the sequencing signal blurs, resulting in a decrease in quality scores.

While expected quality drops are a normal part of the sequencing process, there are also worrisome quality issues that researchers should pay attention to. Overclustering is one such issue, where mixed signals and decreased signal purity can lead to lower quality scores. Additionally, instrumentation breakdown can result in sudden drops in quality or a high percentage of low-quality reads. Identifying and understanding these quality issues are essential for assessing the reliability and accuracy of sequencing data.

Expected Quality Issues:

  • Signal decay leading to quality score drops towards the 3′ end of reads
  • Phasing causing a decrease in quality scores

Worrisome Quality Issues:

  • Overclustering resulting in mixed signals and decreased signal purity
  • Instrumentation breakdown causing sudden drops in quality or high percentage of low-quality reads
Quality Issue Description
Expected Quality Drops Drops in quality scores towards the 3′ end of reads due to signal decay and phasing
Overclustering Mixed signals and decreased signal purity leading to lower quality scores
Instrumentation Breakdown Sudden drops in quality or a high percentage of low-quality reads caused by equipment failure

Assessing Quality Metrics with FastQC Results

When it comes to evaluating the quality of sequencing data, FastQC is a valuable tool that provides comprehensive insights. The tool generates HTML reports containing detailed quality metrics and also generates .zip files that further expand on the information. To access these files, simply unzip them and explore the additional quality metrics they contain. One of the key metrics provided by FastQC is the per sequence quality scores. These scores offer valuable insights into the distribution of average quality scores among the reads, allowing researchers to identify any biases or issues within the data.

In addition to per sequence quality scores, FastQC provides other important metrics that aid in quality assessment. The basic statistics module offers information about the total number of reads, the read length, and the percentage of GC content. These metrics help researchers ensure that the sequencing data meets their expectations and aligns with their experimental design.

Interpreting FastQC Plots for Quality Assessment

FastQC plots are essential for conducting a comprehensive quality assessment of the sequencing data. The per base sequence quality plot provides insights into the distribution of quality scores at each position in the read. This plot allows researchers to assess the overall sequencing data quality and identify any regions with lower quality scores that may require additional scrutiny.

Another important plot is the per base sequence content plot, which illustrates the distribution of the four nucleotide bases (A, T, C, G) at each position in the sequence. This plot can help identify biases or anomalies in the data, such as overrepresentation or underrepresentation of certain bases.

The per sequence GC content plot is also informative for quality assessment. It shows the distribution of the GC content across all sequences, allowing researchers to identify potential issues related to the GC content of their data. For example, extreme peaks or valleys in the plot may indicate contamination or biases in library preparation.

FastQC Plot Description
Per Base Sequence Quality Distribution of quality scores at each position
Per Base Sequence Content Distribution of nucleotide bases (A, T, C, G) at each position
Per Sequence GC Content Distribution of GC content across all sequences

Interpreting FastQC Plots for Quality Assessment

When conducting a thorough quality assessment of sequencing data, it is essential to interpret the FastQC plots to gain insights into the data’s reliability and accuracy. The basic statistics module provides key information about the sample, including the total number of reads, read length, and %GC content. These statistics help us understand if the sequencing data meets our expectations and requirements.

The per base sequence quality plot in FastQC is a valuable tool for assessing the overall sequencing data quality. By examining the distribution of quality scores at each position in the read, we can identify any areas of concern or potential issues that may affect the accuracy of the data. Understanding the variations in quality scores throughout the read helps us assess the reliability of the base calls and make informed decisions during data analysis.

Another important plot in FastQC is the per sequence quality scores, which provides insights into the average quality scores distribution of the reads. By examining this plot, we can assess the overall quality of the sequencing data and identify any biases or issues that may impact the reliability of the results. This information enables us to confidently proceed with data analysis and make informed decisions.

Example FastQC Plots

Let’s take a closer look at some example plots from FastQC to illustrate the insights they provide for quality assessment:

Plot Description
Per Base Sequence Quality This plot shows the distribution of quality scores at each position in the read. It helps identify regions with lower quality scores, which may indicate potential issues with base calling accuracy.
Per Sequence Quality Scores This plot displays the average quality scores distribution of the reads. It helps assess the overall quality of the sequencing data and identify potential biases.
Per Base Sequence Content This plot shows the distribution of nucleotide content at each position in the read. It helps identify regions with biased or unexpected nucleotide compositions.
Per Sequence GC Content This plot displays the distribution of the GC content across all reads. It helps assess the uniformity of the GC content and identify potential biases.

Interpreting these FastQC plots enables us to thoroughly evaluate the quality of sequencing data, identify any potential issues or biases, and make informed decisions during data analysis. By understanding and addressing quality concerns, we can ensure the reliability and accuracy of our results.

Identifying Overrepresented Sequences and Duplicates

During the quality assessment of sequencing data, one crucial aspect is the identification of overrepresented sequences and duplicates. These findings provide valuable insights into potential contamination and library complexity issues, which can impact the accuracy and reliability of the data. By addressing these issues, researchers ensure the quality of the sequencing data and minimize the potential bias or artifacts in their analysis.

To identify overrepresented sequences, FastQC generates a table that lists the sequences appearing in a significant proportion of the total number of sequences. This information is particularly useful in detecting potential contamination or highly expressed genes. It enables researchers to investigate the cause of these overrepresented sequences and take appropriate measures to address any contamination or bias that may have occurred.

Duplicates sequences, on the other hand, can arise from a variety of factors, such as low complexity libraries or excessive PCR amplification. While some level of duplication is expected in certain experiments, it is essential to consider the impact of duplicate reads on data analysis. Researchers can use tools like FastQC to identify duplicated sequences and assess their prevalence in the data. Understanding and managing duplicated sequences help researchers ensure the accuracy and reliability of the sequencing data throughout the analysis process.

Table: Example Overrepresented Sequences and Duplicates

Sequence Occurrences Type
ATCGGATCGATCGATCGA 1000 Contamination
TTTTTTTTTTTTTTTTT 500 Duplicate
CGCGCGCGCGCGCGCGC 200 Contamination

The table above provides an example of overrepresented sequences and duplicates that could be identified during quality assessment. These sequences are categorized based on their occurrences and type, whether it is contamination or duplication. Researchers can use this information to further investigate these sequences and take appropriate actions to minimize their impact on the analysis.

Quality Assessment and Data Analysis

Once we have conducted a thorough quality assessment using tools like FastQC, we can confidently proceed with data analysis. These quality assessment tools provide valuable insights into the overall quality of the sequencing data, allowing us to make informed decisions during the analysis process.

During the data analysis phase, one crucial step is aligning the raw reads to the reference genome or transcriptome. Mapping tools such as salmon and STAR are capable of handling common quality issues that may arise during this process, including adapter contamination, vector contamination, and low-quality bases at the ends of reads. These tools take into account the quality scores assigned to each base and ensure accurate alignment of the reads.

By addressing quality issues and optimizing the alignment process, we can obtain reliable and accurate results from our sequencing data analysis. It is important to account for quality scores and perform appropriate filtering or trimming as needed to ensure the integrity and reliability of the results. This ensures that any potential biases or errors introduced by low-quality bases or contamination are minimized, allowing us to extract meaningful insights from our data.

Overall, the quality assessment and data analysis process play a crucial role in obtaining reliable and meaningful results from sequencing experiments. By aligning the raw reads to the genome or transcriptome and accounting for quality issues, we can confidently draw conclusions and make significant contributions to the field of genomics and transcriptomics.

Jeremy Weaver