Data processing

12/20/2023 0 Comments

Data processing

The GDC Documentation website contains details about each step of the pipeline, the command-line parameters used to run each step, and information about the corresponding files available at the GDC Data Portal. Each summary has a link to its corresponding section of the GDC Documentation Website. Pipeline-specific details about the alignment and downstream analyses can be found in their respective section or documentation site.īrief summaries of the workflow used by the GDC are listed below. The subsequent set of alignments that originate from a single aliquot are then merged. The current virus decoy set contains 10 types of human viruses, including human cytomegalovirus (CMV), Epstein-Barr virus (EBV), hepatitis B (HBV), hepatitis C (HCV), human immunodeficiency virus (HIV), human herpes virus 8 (HHV-8), human T-lymphotropic virus 1 (HTLV-1), Merkel cell polyomavirus (MCV), simian vacuolating virus 40 (SV40) and human papillomavirus (HPV).Īn initial alignment is performed separately on each read group, which is defined as a set of reads that originates from one Illumina sequencing lane. Viral and decoy sequences are included, which draw reads that would not normally map to the human genome, provide information on the presence of oncoviruses, and allow for a more accurate alignment. See the GDC Documentation site for details on the algorithm used for each pipeline. While different alignment algorithms are used for each case depending on read length and type, all alignments are performed on the same version of the GRCh38 reference genome. Reference genome alignment is the first step of data processing for all sequencing-based workflows. The genomic data processing pipelines were developed in consultation with senior experts in the field of cancer genomics and are regularly evaluated and updated as current tools and parameter sets are improved and developed. If the data processing reveals underlying issues in the data, the associated files will be recalled and will not be available through the GDC. When data is successfully processed, it is made available through the GDC Data Portal and other access tools. All sequence data submitted to the GDC is subjected to analysis through these standard pipelines. Array data is processed using data type specific methods.Įach phase of processing is standardized into common pipelines that use open source sequence analysis tools.

The alignment and derived data are available to users via the GDC Data Portal.

The resulting alignments are then processed to produce derived data. Sequence data is aligned (or realigned) to the latest human genome reference. This includes analyses such as tumor sequence variant calls, RNA-Seq gene expression quantification values, and copy-number segmentation values. The GDC uses submitted FASTQ or BAM formatted sequence and microarray data to generate derived analysis data.

0 Comments

YOUR CART

Data processing

Leave a Reply.

Author

Archives

Categories