2025-03-17

Bioconductor

Bioconductor is an open-source software project.
It is community-driven and provides a framework of tools and databases for the analysis of biological data in R.

In a nutshell:

  • Supports researchers in performing rigorous and reproducible analyses
  • Primarily based on the R programming language, but also includes contributions in other languages
  • Founded in 2001, promoted by the Fred Hutchinson Cancer Research Center

Its main goals are:

  • Providing access to statistical and graphical methods for genomic data analysis

  • Facilitating the integration of metadata into analyses, e.g., literature data from PubMed

  • Offering a platform for developing new curated packages

  • Training researchers in computational and statistical methods for biological data analysis

  • All packages undergo review before release

  • Follows an open-source and open-development model



Bioconductor webpage



Bioconductor structure

Bioconductor packages fall into four main categories:

  • Software: Tools for data analysis and visualization

  • Annotation: Resources for integrating biological metadata

  • Experimental data: Datasets for testing and benchmarking methods

  • Workflows: Predefined pipelines for common analysis tasks


Each package includes a vignette that explains its functionalities and provides usage examples.



Installing Bioconductor and its packages

To install Bioconductor, simply run in R console:

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(version = "3.18")


To install Bioconductor packages, use the following command:

BiocManager::install("package_name")

Nevertheless, each Bioconductor package includes a description of the installation command in R, which you can simply copy and paste.


You can also explore available packages by typing:

BiocManager::available( )



Genomic data types

Bioconductor enables the manipulation of various biological data types.

The most common data types include:

  • Genomic sequences (FASTA)

  • Sequencing files (FASTQ)

  • Genomic intervals (BED files)

  • Genomic features (GFF/GTF)

  • Alignment data (SAM and BAM)


In this course, we will focus on handling genomic sequences and intervals, as well as retrieving genomic features.



Genomic sequences

The FASTA format is a text-based format used to represent nucleotide sequences (DNA/RNA) or amino acid sequences (proteins). In this format, nucleotides or amino acids are represented using single-letter codes, and each sequence entry can be preceded by a description or comment.


FASTA files consist of alternating lines of two types:

  • Header line: description of the sequence, preceded by “>” (e.g., >Sequence_ID Description)

  • Sequence data: the actual nucleotide or amino acid sequence


Full genome sequences for many model organisms are available in the BSgenome package, providing easy access to reference genomes for analysis.

Genomic sequences can be processed in Bioconductor using functions from the Biostrings package.



BSgenome package

To retrieve the genome sequence of a specific species, we use the BSgenome package.

BSgenome is a Biostrings-based genome data package because the sequences it contains are stored in fundamental data structures defined within the Biostrings package.
This ensures efficient handling and manipulation of large genomic datasets.

To check which genome packages are available for different species, use:

library(BSgenome)
head(available.genomes( ))
## [1] "BSgenome.Alyrata.JGI.v1"                
## [2] "BSgenome.Amellifera.BeeBase.assembly4"  
## [3] "BSgenome.Amellifera.NCBI.AmelHAv3.1"    
## [4] "BSgenome.Amellifera.UCSC.apiMel2"       
## [5] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [6] "BSgenome.Aofficinalis.NCBI.V1"


The BSgenome package naming format follows this structure:

BSgenome.<species>.<source>.<major_version>

where:

  • <species> refers to the scientific name of the organism (e.g., Hsapiens for Homo sapiens)

  • <source> indicates the genome assembly provider (e.g., UCSC or NCBI)

  • <major_version> represents the genome version (e.g., hg38 for Homo sapiens GRCh38)


If, for example, we want to retrieve human hg19 genome from UCSC database, we need to type the following command in R:

BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19


Then, by typing the name of the BSgenome object in R, we can retrieve detailed information about the genome:

genome
## | BSgenome object for Human
## | - organism: Homo sapiens
## | - provider: UCSC
## | - genome: hg19
## | - release date: June 2013
## | - 298 sequence(s):
## |     chr1                  chr2                  chr3                 
## |     chr4                  chr5                  chr6                 
## |     chr7                  chr8                  chr9                 
## |     chr10                 chr11                 chr12                
## |     chr13                 chr14                 chr15                
## |     ...                   ...                   ...                  
## |     chr19_gl949749_alt    chr19_gl949750_alt    chr19_gl949751_alt   
## |     chr19_gl949752_alt    chr19_gl949753_alt    chr20_gl383577_alt   
## |     chr21_gl383578_alt    chr21_gl383579_alt    chr21_gl383580_alt   
## |     chr21_gl383581_alt    chr22_gl383582_alt    chr22_gl383583_alt   
## |     chr22_kb663609_alt                                               
## | 
## | Tips: call 'seqnames()' on the object to get all the sequence names, call
## | 'seqinfo()' to get the full sequence info, use the '$' or '[[' operator to
## | access a given sequence, see '?BSgenome' for more information.


This summary provides an overview of the genome and guidance on how to access its sequences.

We can use the following functions to explore the genome structure and retrieve specific sequences:

  • seqnames() – Retrieves all chromosome names
head(seqnames(genome))
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"


  • seqlengths() – Extracts chromosome lengths
head(seqlengths(genome))
##      chr1      chr2      chr3      chr4      chr5      chr6 
## 249250621 243199373 198022430 191154276 180915260 171115067


  • $ and [[ ]] accessors – Retrieve a specific chromosome sequence.
    The returned sequence is a DNAString object from the Biostrings package, which is optimized for storing and manipulating large sequences efficiently.
chr1_seq <- genome$chr1 
chr1_seq
## 249250621-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
chr1_seq_alt <- genome[["chr1"]]  
chr1_seq_alt
## 249250621-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN


class(chr1_seq)
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"



DNAString object

To subset a DNAString object, we have two main approaches:

A. Using [] (Bracket Indexing)

chr1 = genome$chr1
chr1[1000000:1000010]
## 11-letter DNAString object
## seq: TGGGCACAGCC


B. Using subseq() function: more explicit way to extract a subsequence

sequence <- subseq(chr1, start = 1000000, end = 1000010)
sequence
## 11-letter DNAString object
## seq: TGGGCACAGCC



Biostrings package

The Biostrings package provides a variety of functions for managing and analyzing genomic sequences efficiently.

Examples of useful functions in Biostrings:

  • Evaluating base frequency using alphabetFrequency()
    Output: A table showing counts of A, C, G, T, and other characters.
alphabetFrequency(sequence)
## A C G T M R W S Y K V H D B N - + . 
## 2 4 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0


  • Generating complement, reverse or reverse complement of a sequence with complement( ), reverse() and reverseComplement( )
complement(sequence)
## 11-letter DNAString object
## seq: ACCCGTGTCGG
reverse(sequence)
## 11-letter DNAString object
## seq: CCGACACGGGT
reverseComplement(sequence)
## 11-letter DNAString object
## seq: GGCTGTGCCCA


  • Translating a DNA sequence into amino acid sequence using translate( ).
    If the sequence length is not a multiple of three, the remaining bases at the end are ignored.
translate(sequence)
## 3-letter AAString object
## seq: WAQ


  • Assessing pattern matching in DNA sequences:
    • Find exact match of a specific motif within a sequence using matchPattern( )
    • Count occurrence of a pattern by countPattern( )
matchPattern("CA", sequence)
## Views on a 11-letter DNAString subject
## subject: TGGGCACAGCC
## views:
##       start end width
##   [1]     5   6     2 [CA]
##   [2]     7   8     2 [CA]
countPattern("CA", sequence)
## [1] 2


By adjusting max.mismatch and min.mismatch, we can tolerate sequencing errors or mutations.
The default setting (max.mismatch = 0, min.mismatch = 0) enforces perfect matching.

matchPattern("CAC", sequence, max.mismatch=1)
## Views on a 11-letter DNAString subject
## subject: TGGGCACAGCC
## views:
##       start end width
##   [1]     5   7     3 [CAC]
##   [2]     7   9     3 [CAG]
countPattern("CAC", sequence, max.mismatch=1)
## [1] 2



Creating and reading a FASTA

To create a FASTA file from a sequence in Biostrings, follow these steps:

  1. Convert a DNAString to a DNAStringSet
    A DNAStringSet object is required to store multiple sequences in a FASTA format.


  1. Set sequence names
    Each FASTA sequence needs a header (identifier). We assign it using names()


  1. Write the sequence to a FASTA file
    Use writeXStringSet() to save the sequence to a FASTA file.
sequence = DNAStringSet(sequence)
names(sequence) <- "My sequence"
sequence
## DNAStringSet object of length 1:
##     width seq                                               names               
## [1]    11 TGGGCACAGCC                                       My sequence
writeStringSet(sequence, "My_sequence.fa")


To read a FASTA file back into R, use readDNAStringSet().





Genomic intervals (BED files)

The BED (Browser Extensible Data) file format is a widely used tab-delimited format for storing genomic regions.

It is commonly used for annotations and analysis results such as transcription start sites (TSS), ChIP-seq peaks, and other genomic features.

For a more comprehensive description, visit https://genome.ucsc.edu/FAQ/FAQformat.html##format1


Required fields:

  • Chromosome: name of the chromosome (e.g., "chr1")

  • Start Position: 0-based start coordinate (first base is 0)

  • End Position: 1-based end coordinate (last base is not included)


Optional (but often useful) fields:

  • Name: Identifier or description of the feature

  • Score: A number between 0 and 1000 (e.g., ChIP-seq peak intensity)

  • Strand: "+" (forward), "-" (reverse), or "." (unknown)


No header line is present in BED files.

BED coordinates are zero-based (start position) and half-open (end position is excluded).

This format is compact and efficient, making it ideal for large genomic datasets.


Example of a BED file

chr1 1000 1050 GeneX 900 + 
chr1 2000 2100 ChIP_peak1 750 - 
chr2 3000 3200 Enhancer 500 .



Manipulating genomic intervals

When working with genomic regions, two key Bioconductor packages are:

  1. GenomicRanges: For creating, manipulating, and analyzing genomic intervals.

  2. rtracklayer: For importing and exporting genomic data (e.g., BED, GFF, BigWig).



Creating a GenomicRanges (GRanges) object in R

To create a GRanges Object, we first need to define genomic intervals using the IRanges package.

  1. Load required packages
library(GenomicRanges) 
library(IRanges)


  1. Create an IRanges Object
    The IRanges package defines intervals (start and end positions) without chromosome or strand information.
intervals = IRanges(start = c(2000, 900, 10000)
                    , end = c(2040, 950, 200000)
)
intervals
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]      2000      2040        41
##   [2]       900       950        51
##   [3]     10000    200000    190001


  1. Convert to GRanges Object
    To create a GRanges Object, we must associate IRanges with chromosomes and strands.
GR = GRanges(seqnames = c("chr1","chr3","chr2")
             , ranges = intervals
             , strand = c("+", "-", "*")
)
GR
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames       ranges strand
##          <Rle>    <IRanges>  <Rle>
##   [1]     chr1    2000-2040      +
##   [2]     chr3      900-950      -
##   [3]     chr2 10000-200000      *
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths


  1. Add metadata (optional)

We can attach gene names, scores, or other annotations to each region.

mcols(GR) <- DataFrame(gene = c("SF3B1", "SRSF7", "TPX2"),
                       score = c(900, 750, 500))
GR
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames       ranges strand |        gene     score
##          <Rle>    <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr1    2000-2040      + |       SF3B1       900
##   [2]     chr3      900-950      - |       SRSF7       750
##   [3]     chr2 10000-200000      * |        TPX2       500
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths



Retrieving information and modifying values in GenomicRanges

A GRanges Object can be manipulated like a vector, allowing us to extract, subset, and modify genomic intervals.

  • Extracting specific rows
    To retrieve the first two rows from a GRanges Object, use indexing:
GR[1:2]
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames    ranges strand |        gene     score
##          <Rle> <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr1 2000-2040      + |       SF3B1       900
##   [2]     chr3   900-950      - |       SRSF7       750
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths


  • Extracting specific informations
    We can retrieve individual components of a GRanges Object using functions: start( ), end( ), seqnames( ), strand(), width(), mcols()
seqnames(GR)  # Chromosome names
## factor-Rle of length 3 with 3 runs
##   Lengths:    1    1    1
##   Values : chr1 chr3 chr2
## Levels(3): chr1 chr3 chr2
start(GR)     # Start positions
## [1]  2000   900 10000
end(GR)       # End positions
## [1]   2040    950 200000
strand(GR)    # Strand information
## factor-Rle of length 3 with 3 runs
##   Lengths: 1 1 1
##   Values : + - *
## Levels(3): + - *
width(GR)     # Interval lengths
## [1]     41     51 190001
mcols(GR)     # Metadata (e.g., gene names, scores)
## DataFrame with 3 rows and 2 columns
##          gene     score
##   <character> <numeric>
## 1       SF3B1       900
## 2       SRSF7       750
## 3        TPX2       500


  • Modifying values in a GRanges Object
    We can update values directly, just like a regular vector:
# Change the chromosome of the first entry
seqnames(GR)[1] <- "chr2"

# Modify the start position of the second interval
# Start must be always minor than End position
start(GR)[2] <- 910

# Change strand of the third interval
strand(GR)[3] <- "+"

# Update metadata (e.g., rename a gene)
mcols(GR)$gene[1] <- "MYC"


In GRanges, chromosome names (seqnames(gr)) are stored as factors.
This means you cannot directly assign a chromosome that is not already in the object.
Instead, you must add the new chromosome as a factor level first:

seqlevels(GR) # only chr1, chr2 and chr3 are defined
## [1] "chr1" "chr3" "chr2"
seqlevels(GR) <- c("chr1", "chr2", "chr3", "chr8", "chr9", "chr21") # Adding levels
seqnames(GR)[1:3] <- c("chr8", "chr9", "chr21")



Adding strand and names to a GRanges Object

When creating a GRanges object, we can assign strand information: +, -, *

GR2 = GRanges(seqnames = c("chr1","chr3","chr2")
              , ranges = intervals
              , strand = c("*", "+", "-")
)
GR2
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames       ranges strand
##          <Rle>    <IRanges>  <Rle>
##   [1]     chr1    2000-2040      *
##   [2]     chr3      900-950      +
##   [3]     chr2 10000-200000      -
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths


We can also modify strand values after creating the object:

strand(GR) = c("*", "+", "-")
GR
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames       ranges strand |        gene     score
##          <Rle>    <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr8    2000-2040      * |         MYC       900
##   [2]     chr9      910-950      + |       SRSF7       750
##   [3]    chr21 10000-200000      - |        TPX2       500
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths


In GRanges, names are stored in metadata columns (e.g., mcols(gr)$name).

names(GR) = c("SFRB1", "FOXA1", "MYC")
GR
## GRanges object with 3 ranges and 2 metadata columns:
##         seqnames       ranges strand |        gene     score
##            <Rle>    <IRanges>  <Rle> | <character> <numeric>
##   SFRB1     chr8    2000-2040      * |         MYC       900
##   FOXA1     chr9      910-950      + |       SRSF7       750
##     MYC    chr21 10000-200000      - |        TPX2       500
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths

As with vectors, we can use names to index and retrieve genomic intervals of interest.

GR["SFRB1"]
## GRanges object with 1 range and 2 metadata columns:
##         seqnames    ranges strand |        gene     score
##            <Rle> <IRanges>  <Rle> | <character> <numeric>
##   SFRB1     chr8 2000-2040      * |         MYC       900
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths



Metadata columns

In GRanges, we can store multiple metadata columns (e.g., gene names, expression levels, scores) by specifying them directly in the GRanges() function.

GR3 = GRanges(seqnames = c("chr1","chr3","chr2")
              , ranges = intervals
              , strand = c("*", "+", "-")
              , score = c(10, 40, 60)
              , expression = c(2.5, 5.0, 3.2)
)
GR3
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames       ranges strand |     score expression
##          <Rle>    <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1    2000-2040      * |        10        2.5
##   [2]     chr3      900-950      + |        40        5.0
##   [3]     chr2 10000-200000      - |        60        3.2
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths


Using mcols( ) function we can view all metadata columns, while using $ accessor we can get values from a specific column

mcols(GR3)
## DataFrame with 3 rows and 2 columns
##       score expression
##   <numeric>  <numeric>
## 1        10        2.5
## 2        40        5.0
## 3        60        3.2
GR3$expression
## [1] 2.5 5.0 3.2



Converting between GRanges and data.frame

In Bioconductor, we often need to convert GRanges objects to data.frame (for easy manipulation) and vice versa (to restore genomic features).

  • Convert a GRanges Object to a data.frame : use the as.data.frame() function.
    Now, the genomic intervals are stored as a standard data frame, which makes them easier to manipulate.
GR3 = as.data.frame(GR3)
GR3
##   seqnames start    end  width strand score expression
## 1     chr1  2000   2040     41      *    10        2.5
## 2     chr3   900    950     51      +    40        5.0
## 3     chr2 10000 200000 190001      -    60        3.2


  • Convert a data.frame back to GRanges
    If we want to restore the genomic intervals, we must convert the data frame back into a GRanges object.
    The makeGRangesFromDataFrame() function automatically converts a data.frame into a GRanges object, making sure that genomic coordinates and metadata are correctly mapped:
GR3 = makeGRangesFromDataFrame(GR3,
                               , keep.extra.columns = TRUE           # Keep metadata columns
                               , ignore.strand = FALSE               # Use strand information
                               , seqinfo = NULL                      # Optional: Provide Seqinfo object for chromosome lengths
                               , seqnames.field = c("seqnames")      # Column storing chromosome names
                               , start.field = "start"               # Names for the column in df that contains the start positions of the genomic ranges
                               , end.field = c("end")                # Names for the column in df that contains the end positions of the genomic ranges. 
                               , strand.field = "strand"             # Names for the column in df that contains the start positions of the genomic ranges. 
                               , starts.in.df.are.0based = FALSE     # Set to TRUE if 0-based coordinates 
)

GR3
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames       ranges strand |     score expression
##          <Rle>    <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1    2000-2040      * |        10        2.5
##   [2]     chr3      900-950      + |        40        5.0
##   [3]     chr2 10000-200000      - |        60        3.2
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths



Feature GRanges data.frame
Stores Genomic Ranges Yes (IRanges for intervals) No (uses numeric start/end)
Metadata Support Yes (mcols(gr)) Yes (columns)
Strand Info Yes (strand(gr)) No (stored as column)
Easier to Analyze in R More specialized Standard data manipulation





2025-03-18

Shifting genomic ranges in GRanges

The shift() function allows us to move genomic intervals by a specified number of bases without changing their length.
This can be useful for tasks like simulating mutations or adjusting genomic annotations.

GR3[1]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     score expression
##          <Rle> <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1 2000-2040      * |        10        2.5
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths
GenomicRanges::shift(GR3[1], shift = 40)
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     score expression
##          <Rle> <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1 2040-2080      * |        10        2.5
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths


We can apply shift() to all genomic intervals at once:

GenomicRanges::shift(GR3, shift = 50)
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames       ranges strand |     score expression
##          <Rle>    <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1    2050-2090      * |        10        2.5
##   [2]     chr3     950-1000      + |        40        5.0
##   [3]     chr2 10050-200050      - |        60        3.2
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths


If needed, we can adjust shift direction based on strand:

shift(GR3[strand(GR3) == "+"], shift = 50)
shift(GR3[strand(GR3) == "-"], shift = -50)



Resizing genomic ranges

The resize() function allows us to change the width (length) of genomic intervals while specifying whether to fix the start, end, or center of the interval.

resize() is strand-aware, meaning:

  • On the “+” strand: start is the beginning of the interval
  • On the “-” strand: start refers to the rightmost coordinate (end).
GR3[1]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     score expression
##          <Rle> <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1 2000-2040      * |        10        2.5
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths
resize(GR3[1], width = 100, fix = "start")
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     score expression
##          <Rle> <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1 2000-2099      * |        10        2.5
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths
resize(GR3[1], width = 100, fix = "end")
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     score expression
##          <Rle> <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1 1941-2040      * |        10        2.5
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths
resize(GR3[1], width = 100, fix = "center")
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     score expression
##          <Rle> <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1 1970-2069      * |        10        2.5
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths
GR3[3]
## GRanges object with 1 range and 2 metadata columns:
##       seqnames       ranges strand |     score expression
##          <Rle>    <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr2 10000-200000      - |        60        3.2
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths
resize(GR3[3], width = 100,fix = "start")
## GRanges object with 1 range and 2 metadata columns:
##       seqnames        ranges strand |     score expression
##          <Rle>     <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr2 199901-200000      - |        60        3.2
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths


fix Option Positive Strand (+) Negative Strand (-) Unstranded (*)
“start” Keeps start fixed, expands to the right Keeps start fixed (which is actually the “end”), expands to the left Keeps start fixed
“end” Keeps end fixed, expands to the left Keeps end fixed (which is actually the “start”), expands to the right Keeps end fixed
“center” Expands symmetrically around the midpoint Expands symmetrically around the midpoint Expands symmetrically



Merging overlapping genomic ranges

The reduce() function in GenomicRanges is used to merge overlapping genomic intervals while preserving chromosome and strand information.

If ignore.strand = TRUE, merges intervals across strands.

GR4 = GRanges(seqnames = c("chr1", "chr1", "chr1")
              , strand = c("+", "-", "-")
              , ranges = IRanges(start = c(1,8,20), end = c(24, 50, 100))
              )

# Merge overlapping regions (strand-aware)
GenomicRanges::reduce(GR4)
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1      1-24      +
##   [2]     chr1     8-100      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Merge ignoring strand
GenomicRanges::reduce(GR4, ignore.strand = TRUE)
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1     1-100      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths



Finding overlapping genomic ranges

The findOverlaps() function in GenomicRanges helps identify which genomic intervals overlap between two GRanges Objects.
It returns a Hits object, which stores indices of overlapping elements from both datasets.

Specifically, we define:

  • Query: first GRanges Object
  • Subject: second GRanges Object



By applying findOverlaps() we get a Hits object that stores:

  • Query indices: Indices of overlapping elements in Query

  • Subject indices: Indices of overlapping elements in Subject



We can extract those indices using:

  • queryHits(overlap): Indices of overlapping elements in Query

  • subjectHits(overlap): Indices of overlapping elements in Subject

GR4 = GRanges(seqnames = c("chr1", "chr1", "chr1")
              , strand = c("+", "-", "-")
              , ranges = IRanges(start = c(1,8,20), end = c(24, 50, 100))
              )
GR5 = GRanges(seqnames = c("chr1", "chr1", "chr1")
              , strand = c("+", "+", "+")
              , ranges = IRanges(start = c(5,10,100), end = c(60, 20, 150))
              )

overlap = findOverlaps(GR4, GR5)
queryHits(overlap)
## [1] 1 1
subjectHits(overlap)
## [1] 1 2


These functions give us the indices of overlapping intervals, so we can use them to retrieve the overlapping ranges:

GR4[queryHits(overlap)]
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1      1-24      +
##   [2]     chr1      1-24      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
GR5[subjectHits(overlap)]
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1      5-60      +
##   [2]     chr1     10-20      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths


If we want to ignore strand when finding overlaps, we use ignore.strand = TRUE:

findOverlaps(GR4, GR5, ignore.strand = TRUE)

This considers + and - strands as the same and finds more matches.



Importing and exporting in BED format

Genomic intervals are often stored in BED files.
The rtracklayer package provides functions to import BED files as GRanges objects and export GRanges objects as BED files:

  • import.bed("file.bed") : Reads a BED file and converts it into a GRanges object
  • export.bed(gr_object, "file.bed"): Saves a GRanges object as a BED file



Combining GenomicRanges and BSgenome to extract sequences

We have learned how to extract specific genomic sequences using:

  • subseq() or []: Extracting sub-sequences from a DNAString object

  • BSgenome + GenomicRanges: Extracting sequences across the entire genome

With GenomicRanges, we can retrieve sequences for multiple genomic intervals at once from a BSgenome reference genome.


For example, if we want to extract sequences from hg19 human genome:

library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg19)
library(Biostrings)

# Define genomic intervals in a GRanges object
GR6 = GRanges(seqnames = c("chr1", "chr1", "chr1")
              , strand = c("+", "-", "-")
              , ranges = IRanges(start = c(100000,800000,2000000), end = c(100020,800030,2000040))
              )

# Extract sequences from the reference genome
sequences = getSeq(BSgenome.Hsapiens.UCSC.hg19, GR6)
sequences
## DNAStringSet object of length 3:
##     width seq
## [1]    21 CACTAAGCACACAGAGAATAA
## [2]    31 TTAAAAGCTTAAAGCAAACAGAAAGACATGC
## [3]    41 GCGTGACCCTCCTTTGGAAAAGGGTTCCTGCACGTGTGATT


The result is a DNAStringSet object containing sequences for each genomic interval.

We can export the extracted sequences in FASTA format using Biostrings::writeXStringSet().

# Assign sequence names
names(sequences) <- c("region1", "region2", "region3")

# Save sequences to a FASTA file
writeXStringSet(sequences, "extracted_sequences.fasta")





A brief recap

The central dogma of biology

The cell is a highly complex system, where essential biological processes occur, including:

  • regulation of gene expression
  • transport of molecules
  • synthesis of biomolecules
  • conversion of chemical energy

Gene expression involves a sequence of molecular transformations: DNA → RNA → Protein

In 1956, Francis Crick described this directional flow of genetic information as the “central dogma” of molecular biology.


This concept forms the foundation of molecular genetics, explaining how genetic instructions in DNA are transcribed into RNA and translated into proteins.

Alberts et al. 2002, the pathway from DNA to protein
Alberts et al. 2002, the pathway from DNA to protein


The structure of a transcript

Once a precursor messenger RNA (pre-mRNA) is synthesized, it undergoes several modifications before being transported out of the nucleus for translation.

A key step in mRNA maturation is splicing, a process where introns (non-coding sequences) are removed whereas exons (coding sequences) are joined together.
The final product is a mature messenger RNA (mRNA), which can be translated into a functional protein.


More than 95% of human genes are transcribed into multiple isoforms (mature transcripts) that encode for different proteins, thus expanding protein diversity. This mechanism is called alternative splicing.

Chimica online, alternative splicing
Chimica online, alternative splicing


Key takeaways:

  • Exons are the building blocks of transcripts: They contain the coding sequences that determine protein structure.

  • 95% of genes produce multiple transcripts: Through alternative splicing, a single gene can generate different mRNA isoforms, leading to greater protein diversity.



Genes as genomic features

Genomic features, such as genes, transcripts, and exons, are stored in GTF (Gene Transfer Format) or GFF (General Feature Format) files.
These files allow us to reconstruct gene models, which represent the structure and organization of genes within the genome.

Gene models can be accessed using the TxDb (Transcript Database) packages in Bioconductor.
The standard format for TxDb packages follows this structure: TxDb.<species>.<source>.<major version>.<table>

BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")


If you have a GTF file stored locally, you can import it and convert it into a TxDb object using the makeTxDbFromGFF() function from the GenomicFeatures package:

library(GenomicFeatures)
txdb <- makeTxDbFromGFF(file = "GENCODE_v44.gtf")

This allows you to analyze custom gene annotations or work with specific genome versions.



TxDb Object: Exploring Gene Models

A TxDb object stores structured genomic annotations, including gene, transcript, exon, and CDS (coding sequence) information.

Typing the name of a TxDb object provides an overview of the gene model it contains:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
TxDb.Hsapiens.UCSC.hg19.knownGene
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1


This displays key metadata such as:

  • The genome version (e.g., hg19)

  • The data source (e.g., UCSC, Ensembl)

  • The number of chromosomes, genes, transcripts, and exons



Interrogating a TxDb object

A TxDb object allows us to extract various genomic features such as genes, transcripts, exons, coding sequences (CDS), and promoters.


The output of these queries is typically a GRanges Object, which makes it easy to manipulate genomic intervals.

Feature Function Call
Genes genes(txdb)
Transcripts transcripts(txdb)
Exons exons(txdb)
CDS (Coding Sequences) cds(txdb)
Promoters promoters(txdb, upstream, downstream)


We can retrieve transcript IDs and corresponding gene IDs by specifying additional columns in the query:

transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene, columns = c("gene_id","tx_id"))
## GRanges object with 82960 ranges and 2 metadata columns:
##                 seqnames        ranges strand |         gene_id     tx_id
##                    <Rle>     <IRanges>  <Rle> | <CharacterList> <integer>
##       [1]           chr1   11874-14409      + |       100287102         1
##       [2]           chr1   11874-14409      + |       100287102         2
##       [3]           chr1   11874-14409      + |       100287102         3
##       [4]           chr1   69091-70008      + |           79501         4
##       [5]           chr1 321084-321115      + |                         5
##       ...            ...           ...    ... .             ...       ...
##   [82956] chrUn_gl000237        1-2686      - |                     82956
##   [82957] chrUn_gl000241   20433-36875      - |                     82957
##   [82958] chrUn_gl000243   11501-11530      + |                     82958
##   [82959] chrUn_gl000243   13608-13637      + |                     82959
##   [82960] chrUn_gl000247     5787-5816      - |                     82960
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome


The promoters() function identifies regions surrounding the transcription start site (TSS). It requires specifying: - Upstream (e.g., 1000 bases) - Downstream (e.g., 100 bases)

promoters(TxDb.Hsapiens.UCSC.hg19.knownGene
          , upstream = 1000
          , downstream = 100)
## GRanges object with 82960 ranges and 2 metadata columns:
##                    seqnames        ranges strand |     tx_id     tx_name
##                       <Rle>     <IRanges>  <Rle> | <integer> <character>
##   uc001aaa.3           chr1   10874-11973      + |         1  uc001aaa.3
##   uc010nxq.1           chr1   10874-11973      + |         2  uc010nxq.1
##   uc010nxr.1           chr1   10874-11973      + |         3  uc010nxr.1
##   uc001aal.1           chr1   68091-69190      + |         4  uc001aal.1
##   uc001aaq.2           chr1 320084-321183      + |         5  uc001aaq.2
##          ...            ...           ...    ... .       ...         ...
##   uc011mgu.1 chrUn_gl000237     2587-3686      - |     82956  uc011mgu.1
##   uc011mgv.2 chrUn_gl000241   36776-37875      - |     82957  uc011mgv.2
##   uc011mgw.1 chrUn_gl000243   10501-11600      + |     82958  uc011mgw.1
##   uc022brq.1 chrUn_gl000243   12608-13707      + |     82959  uc022brq.1
##   uc022brr.1 chrUn_gl000247     5717-6816      - |     82960  uc022brr.1
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome


The transcriptLengths() function computes the sum of exon lengths in each transcript (not including introns).

head(transcriptLengths(TxDb.Hsapiens.UCSC.hg19.knownGene))
##   tx_id    tx_name   gene_id nexon tx_len
## 1     1 uc001aaa.3 100287102     3   1652
## 2     2 uc010nxq.1 100287102     3   1488
## 3     3 uc010nxr.1 100287102     3   1595
## 4     4 uc001aal.1     79501     1    918
## 5     5 uc001aaq.2      <NA>     1     32
## 6     6 uc001aar.2      <NA>     1     62

Note: This function measures the processed transcript length (sum of exonic regions), not the full genomic region transcribed into RNA. To get the total transcribed DNA length, use:

head(width(transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)))
## [1] 2536 2536 2536  918   32   62



Extracting information

The select() function allows us to retrieve specific annotations from a TxDb object, such as transcript names, gene IDs, or exon details.
In particular, we can:

  1. Check available data fields
    • Use columns() to list available information (e.g., transcript names, gene lengths)

    • Use keytypes() to see which column types can be used as keys (e.g., GENEID, TXID)


  1. Define your key of interest: create a vector of gene IDs or transcript IDs for querying.


  1. Use select() to extract data



For example:

# Step 1: Define gene IDs of interest  
keys <- c("100033416", "100033417", "100033420")  

# Step 2: Check available columns and key types  
columns(TxDb.Hsapiens.UCSC.hg19.knownGene)  # Lists all available metadata fields  
##  [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"  
##  [6] "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"  
## [11] "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"   
## [16] "TXEND"      "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"  
## [21] "TXTYPE"
keytypes(TxDb.Hsapiens.UCSC.hg19.knownGene) # Lists valid key types  
## [1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"     "TXNAME"
# Step 3: Retrieve transcript names for selected gene IDs  
select(TxDb.Hsapiens.UCSC.hg19.knownGene, keys = keys, columns = "TXNAME", keytype = "GENEID")
##      GENEID     TXNAME
## 1 100033416 uc001yxl.4
## 2 100033417 uc001yxo.3
## 3 100033420 uc001yxr.3



Combining GenomicFeatures and BSgenome for Transcript Sequences

We can extract transcript sequences by combining a TxDb object (which contains gene and transcripts annotations) with a BSgenome object (which stores genome sequences).

The function extractTranscriptSeqs() allows us to extract full transcript sequences using a reference genome.

extractTranscriptSeqs(x = BSgenome.Hsapiens.UCSC.hg19, transcripts = TxDb.Hsapiens.UCSC.hg19.knownGene)
## DNAStringSet object of length 82960:
##          width seq                                          names               
##     [1]   1652 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 1
##     [2]   1488 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 2
##     [3]   1595 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 3
##     [4]    918 ATGGTGACTGAATTCATTTTT...ATTCTAGTGTAAAGTTTTAG 4
##     [5]     32 TACAGACCAAGCTCATGACTCACAATGGCCTA             5
##     ...    ... ...
## [82956]   1217 GCCAGTTTAGGGTCTCTGGTA...AACCTCCGCCTCCTGAGATC 82956
## [82957]    737 CTCCACTTCTGATCCTCCCCG...CCGCTTTATTAGATGCAGTG 82957
## [82958]     30 TGGTGAATTTCAGCCAAAGTGGCCAAAGAA               82958
## [82959]     30 TTGCAGAGGTGGCTGGTTGCTCTTTGAGCC               82959
## [82960]     30 TGGTGAATTTCAGCCAAAGTGGCCAAAGAA               82960


Key applications:

  • Analyze transcript sequences from genome data

  • Compare transcript isoforms

  • Extract sequences for further downstream analysis (e.g., translation, motif search)



Extracting features as a GRangesList

If we need to group genomic features, we can use the following functions, which return a GRangesList:

  • transcriptsBy(): Groups transcripts by gene or other features
  • exonsBy(): Groups exons by transcript or gene
  • cdsBy(): Groups CDS (coding sequences) by transcript or gene


These will return a GRangesList where each element contains the exons of a specific transcript.

exonsBy(x = TxDb.Hsapiens.UCSC.hg19.knownGene, by = "tx")[1:2]
## GRangesList object of length 2:
## $`1`
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames      ranges strand |   exon_id   exon_name exon_rank
##          <Rle>   <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr1 11874-12227      + |         1        <NA>         1
##   [2]     chr1 12613-12721      + |         3        <NA>         2
##   [3]     chr1 13221-14409      + |         5        <NA>         3
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## $`2`
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames      ranges strand |   exon_id   exon_name exon_rank
##          <Rle>   <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr1 11874-12227      + |         1        <NA>         1
##   [2]     chr1 12595-12721      + |         2        <NA>         2
##   [3]     chr1 13403-14409      + |         6        <NA>         3
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

We can change the by parameter to group differently:

  • By transcript = tx

  • By gene = gene



Export a TxDb Object as a GTF/GFF file

If we need to export a TxDb object (gene annotation data) into a GTF or GFF file, we can use the export.gff() function from the rtracklayer package.

export.gff(TxDb.Hsapiens.UCSC.hg19.knownGene, con = "file.gtf", format ="gtf")
export.gff(TxDb.Hsapiens.UCSC.hg19.knownGene, con = "file.gff", format ="gff")

GTF (Gene Transfer Format) is a widely used annotation format in gene expression analysis.
GFF (General Feature Format) is a more general format used in genomic annotations.