2025-03-17

Bioconductor

Bioconductor is an open-source software project.
It is community-driven and provides a framework of tools and databases for the analysis of biological data in R.

In a nutshell:

Supports researchers in performing rigorous and reproducible analyses
Primarily based on the R programming language, but also includes contributions in other languages
Founded in 2001, promoted by the Fred Hutchinson Cancer Research Center

Its main goals are:

Providing access to statistical and graphical methods for genomic data analysis
Facilitating the integration of metadata into analyses, e.g., literature data from PubMed
Offering a platform for developing new curated packages
Training researchers in computational and statistical methods for biological data analysis
All packages undergo review before release
Follows an open-source and open-development model

Bioconductor webpage

Bioconductor structure

Bioconductor packages fall into four main categories:

Software: Tools for data analysis and visualization
Annotation: Resources for integrating biological metadata
Experimental data: Datasets for testing and benchmarking methods
Workflows: Predefined pipelines for common analysis tasks

Each package includes a vignette that explains its functionalities and provides usage examples.

Installing Bioconductor and its packages

To install Bioconductor, simply run in R console:

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(version = "3.18")

To install Bioconductor packages, use the following command:

BiocManager::install("package_name")

Nevertheless, each Bioconductor package includes a description of the installation command in R, which you can simply copy and paste.

You can also explore available packages by typing:

BiocManager::available( )

Genomic data types

Bioconductor enables the manipulation of various biological data types.

The most common data types include:

Genomic sequences (FASTA)
Sequencing files (FASTQ)
Genomic intervals (BED files)
Genomic features (GFF/GTF)
Alignment data (SAM and BAM)

In this course, we will focus on handling genomic sequences and intervals, as well as retrieving genomic features.

Genomic sequences

The FASTA format is a text-based format used to represent nucleotide sequences (DNA/RNA) or amino acid sequences (proteins). In this format, nucleotides or amino acids are represented using single-letter codes, and each sequence entry can be preceded by a description or comment.

FASTA files consist of alternating lines of two types:

Header line: description of the sequence, preceded by “>” (e.g., >Sequence_ID Description)
Sequence data: the actual nucleotide or amino acid sequence

Full genome sequences for many model organisms are available in the BSgenome package, providing easy access to reference genomes for analysis.

Genomic sequences can be processed in Bioconductor using functions from the Biostrings package.

BSgenome package

To retrieve the genome sequence of a specific species, we use the BSgenome package.

BSgenome is a Biostrings-based genome data package because the sequences it contains are stored in fundamental data structures defined within the Biostrings package.
This ensures efficient handling and manipulation of large genomic datasets.

To check which genome packages are available for different species, use:

library(BSgenome)
head(available.genomes( ))

## [1] "BSgenome.Alyrata.JGI.v1"                
## [2] "BSgenome.Amellifera.BeeBase.assembly4"  
## [3] "BSgenome.Amellifera.NCBI.AmelHAv3.1"    
## [4] "BSgenome.Amellifera.UCSC.apiMel2"       
## [5] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [6] "BSgenome.Aofficinalis.NCBI.V1"

The BSgenome package naming format follows this structure:

BSgenome.<species>.<source>.<major_version>

where:

<species> refers to the scientific name of the organism (e.g., Hsapiens for Homo sapiens)
<source> indicates the genome assembly provider (e.g., UCSC or NCBI)
<major_version> represents the genome version (e.g., hg38 for Homo sapiens GRCh38)

If, for example, we want to retrieve human hg19 genome from UCSC database, we need to type the following command in R:

BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")

library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19

Then, by typing the name of the BSgenome object in R, we can retrieve detailed information about the genome:

genome

## | BSgenome object for Human
## | - organism: Homo sapiens
## | - provider: UCSC
## | - genome: hg19
## | - release date: June 2013
## | - 298 sequence(s):
## |     chr1                  chr2                  chr3                 
## |     chr4                  chr5                  chr6                 
## |     chr7                  chr8                  chr9                 
## |     chr10                 chr11                 chr12                
## |     chr13                 chr14                 chr15                
## |     ...                   ...                   ...                  
## |     chr19_gl949749_alt    chr19_gl949750_alt    chr19_gl949751_alt   
## |     chr19_gl949752_alt    chr19_gl949753_alt    chr20_gl383577_alt   
## |     chr21_gl383578_alt    chr21_gl383579_alt    chr21_gl383580_alt   
## |     chr21_gl383581_alt    chr22_gl383582_alt    chr22_gl383583_alt   
## |     chr22_kb663609_alt                                               
## | 
## | Tips: call 'seqnames()' on the object to get all the sequence names, call
## | 'seqinfo()' to get the full sequence info, use the '$' or '[[' operator to
## | access a given sequence, see '?BSgenome' for more information.

This summary provides an overview of the genome and guidance on how to access its sequences.

We can use the following functions to explore the genome structure and retrieve specific sequences:

seqnames() – Retrieves all chromosome names

head(seqnames(genome))

## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"

seqlengths() – Extracts chromosome lengths

head(seqlengths(genome))

##      chr1      chr2      chr3      chr4      chr5      chr6 
## 249250621 243199373 198022430 191154276 180915260 171115067

$ and [[ ]] accessors – Retrieve a specific chromosome sequence.
The returned sequence is a DNAString object from the Biostrings package, which is optimized for storing and manipulating large sequences efficiently.

chr1_seq <- genome$chr1 
chr1_seq

## 249250621-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

chr1_seq_alt <- genome[["chr1"]]  
chr1_seq_alt

## 249250621-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

class(chr1_seq)

## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"

DNAString object

To subset a DNAString object, we have two main approaches:

A. Using [] (Bracket Indexing)

chr1 = genome$chr1
chr1[1000000:1000010]

## 11-letter DNAString object
## seq: TGGGCACAGCC

B. Using subseq() function: more explicit way to extract a subsequence

sequence <- subseq(chr1, start = 1000000, end = 1000010)
sequence

## 11-letter DNAString object
## seq: TGGGCACAGCC

Biostrings package

The Biostrings package provides a variety of functions for managing and analyzing genomic sequences efficiently.

Examples of useful functions in Biostrings:

Evaluating base frequency using alphabetFrequency()
Output: A table showing counts of A, C, G, T, and other characters.

alphabetFrequency(sequence)

## A C G T M R W S Y K V H D B N - + . 
## 2 4 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Generating complement, reverse or reverse complement of a sequence with complement( ), reverse() and reverseComplement( )

complement(sequence)

## 11-letter DNAString object
## seq: ACCCGTGTCGG

reverse(sequence)

## 11-letter DNAString object
## seq: CCGACACGGGT

reverseComplement(sequence)

## 11-letter DNAString object
## seq: GGCTGTGCCCA

Translating a DNA sequence into amino acid sequence using translate( ).
If the sequence length is not a multiple of three, the remaining bases at the end are ignored.

translate(sequence)

## 3-letter AAString object
## seq: WAQ

Assessing pattern matching in DNA sequences:
- Find exact match of a specific motif within a sequence using matchPattern( )
- Count occurrence of a pattern by countPattern( )

matchPattern("CA", sequence)

## Views on a 11-letter DNAString subject
## subject: TGGGCACAGCC
## views:
##       start end width
##   [1]     5   6     2 [CA]
##   [2]     7   8     2 [CA]

countPattern("CA", sequence)

## [1] 2

By adjusting max.mismatch and min.mismatch, we can tolerate sequencing errors or mutations.
The default setting (max.mismatch = 0, min.mismatch = 0) enforces perfect matching.

matchPattern("CAC", sequence, max.mismatch=1)

## Views on a 11-letter DNAString subject
## subject: TGGGCACAGCC
## views:
##       start end width
##   [1]     5   7     3 [CAC]
##   [2]     7   9     3 [CAG]

countPattern("CAC", sequence, max.mismatch=1)

## [1] 2

Creating and reading a FASTA

To create a FASTA file from a sequence in Biostrings, follow these steps:

Convert a DNAString to a DNAStringSet
A DNAStringSet object is required to store multiple sequences in a FASTA format.

Set sequence names
Each FASTA sequence needs a header (identifier). We assign it using names()

Write the sequence to a FASTA file
Use writeXStringSet() to save the sequence to a FASTA file.

sequence = DNAStringSet(sequence)
names(sequence) <- "My sequence"
sequence

## DNAStringSet object of length 1:
##     width seq                                               names               
## [1]    11 TGGGCACAGCC                                       My sequence

writeStringSet(sequence, "My_sequence.fa")

To read a FASTA file back into R, use readDNAStringSet().

Genomic intervals (BED files)

The BED (Browser Extensible Data) file format is a widely used tab-delimited format for storing genomic regions.

It is commonly used for annotations and analysis results such as transcription start sites (TSS), ChIP-seq peaks, and other genomic features.

For a more comprehensive description, visit https://genome.ucsc.edu/FAQ/FAQformat.html##format1

Required fields:

Chromosome: name of the chromosome (e.g., "chr1")
Start Position: 0-based start coordinate (first base is 0)
End Position: 1-based end coordinate (last base is not included)

Optional (but often useful) fields:

Name: Identifier or description of the feature
Score: A number between 0 and 1000 (e.g., ChIP-seq peak intensity)
Strand: "+" (forward), "-" (reverse), or "." (unknown)

No header line is present in BED files.

BED coordinates are zero-based (start position) and half-open (end position is excluded).

This format is compact and efficient, making it ideal for large genomic datasets.

Example of a BED file

chr1 1000 1050 GeneX 900 + 
chr1 2000 2100 ChIP_peak1 750 - 
chr2 3000 3200 Enhancer 500 .

Manipulating genomic intervals

When working with genomic regions, two key Bioconductor packages are:

GenomicRanges: For creating, manipulating, and analyzing genomic intervals.
rtracklayer: For importing and exporting genomic data (e.g., BED, GFF, BigWig).

Creating a `GenomicRanges (GRanges) object` in R

To create a GRanges Object, we first need to define genomic intervals using the IRanges package.

Load required packages

library(GenomicRanges) 
library(IRanges)

Create an IRanges Object
The IRanges package defines intervals (start and end positions) without chromosome or strand information.

intervals = IRanges(start = c(2000, 900, 10000)
                    , end = c(2040, 950, 200000)
)
intervals

## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]      2000      2040        41
##   [2]       900       950        51
##   [3]     10000    200000    190001

Convert to GRanges Object
To create a GRanges Object, we must associate IRanges with chromosomes and strands.

GR = GRanges(seqnames = c("chr1","chr3","chr2")
             , ranges = intervals
             , strand = c("+", "-", "*")
)
GR

## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames       ranges strand
##          <Rle>    <IRanges>  <Rle>
##   [1]     chr1    2000-2040      +
##   [2]     chr3      900-950      -
##   [3]     chr2 10000-200000      *
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

Add metadata (optional)

We can attach gene names, scores, or other annotations to each region.

mcols(GR) <- DataFrame(gene = c("SF3B1", "SRSF7", "TPX2"),
                       score = c(900, 750, 500))
GR

## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames       ranges strand |        gene     score
##          <Rle>    <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr1    2000-2040      + |       SF3B1       900
##   [2]     chr3      900-950      - |       SRSF7       750
##   [3]     chr2 10000-200000      * |        TPX2       500
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

Retrieving information and modifying values in `GenomicRanges`

A GRanges Object can be manipulated like a vector, allowing us to extract, subset, and modify genomic intervals.

Extracting specific rows
To retrieve the first two rows from a GRanges Object, use indexing:

GR[1:2]

## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames    ranges strand |        gene     score
##          <Rle> <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr1 2000-2040      + |       SF3B1       900
##   [2]     chr3   900-950      - |       SRSF7       750
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

Extracting specific informations
We can retrieve individual components of a GRanges Object using functions: start( ), end( ), seqnames( ), strand(), width(), mcols()

seqnames(GR)  # Chromosome names

## factor-Rle of length 3 with 3 runs
##   Lengths:    1    1    1
##   Values : chr1 chr3 chr2
## Levels(3): chr1 chr3 chr2

start(GR)     # Start positions

## [1]  2000   900 10000

end(GR)       # End positions

## [1]   2040    950 200000

strand(GR)    # Strand information

## factor-Rle of length 3 with 3 runs
##   Lengths: 1 1 1
##   Values : + - *
## Levels(3): + - *

width(GR)     # Interval lengths

## [1]     41     51 190001

mcols(GR)     # Metadata (e.g., gene names, scores)

## DataFrame with 3 rows and 2 columns
##          gene     score
##   <character> <numeric>
## 1       SF3B1       900
## 2       SRSF7       750
## 3        TPX2       500

Modifying values in a GRanges Object
We can update values directly, just like a regular vector:

# Change the chromosome of the first entry
seqnames(GR)[1] <- "chr2"

# Modify the start position of the second interval
# Start must be always minor than End position
start(GR)[2] <- 910

# Change strand of the third interval
strand(GR)[3] <- "+"

# Update metadata (e.g., rename a gene)
mcols(GR)$gene[1] <- "MYC"

In GRanges, chromosome names (seqnames(gr)) are stored as factors.
This means you cannot directly assign a chromosome that is not already in the object.
Instead, you must add the new chromosome as a factor level first:

seqlevels(GR) # only chr1, chr2 and chr3 are defined

## [1] "chr1" "chr3" "chr2"

seqlevels(GR) <- c("chr1", "chr2", "chr3", "chr8", "chr9", "chr21") # Adding levels
seqnames(GR)[1:3] <- c("chr8", "chr9", "chr21")

Adding strand and names to a `GRanges Object`

When creating a GRanges object, we can assign strand information: +, -, *

GR2 = GRanges(seqnames = c("chr1","chr3","chr2")
              , ranges = intervals
              , strand = c("*", "+", "-")
)
GR2

## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames       ranges strand
##          <Rle>    <IRanges>  <Rle>
##   [1]     chr1    2000-2040      *
##   [2]     chr3      900-950      +
##   [3]     chr2 10000-200000      -
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

We can also modify strand values after creating the object:

strand(GR) = c("*", "+", "-")
GR

## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames       ranges strand |        gene     score
##          <Rle>    <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr8    2000-2040      * |         MYC       900
##   [2]     chr9      910-950      + |       SRSF7       750
##   [3]    chr21 10000-200000      - |        TPX2       500
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths

In GRanges, names are stored in metadata columns (e.g., mcols(gr)$name).

names(GR) = c("SFRB1", "FOXA1", "MYC")
GR

## GRanges object with 3 ranges and 2 metadata columns:
##         seqnames       ranges strand |        gene     score
##            <Rle>    <IRanges>  <Rle> | <character> <numeric>
##   SFRB1     chr8    2000-2040      * |         MYC       900
##   FOXA1     chr9      910-950      + |       SRSF7       750
##     MYC    chr21 10000-200000      - |        TPX2       500
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths

As with vectors, we can use names to index and retrieve genomic intervals of interest.

GR["SFRB1"]

## GRanges object with 1 range and 2 metadata columns:
##         seqnames    ranges strand |        gene     score
##            <Rle> <IRanges>  <Rle> | <character> <numeric>
##   SFRB1     chr8 2000-2040      * |         MYC       900
##   -------
##   seqinfo: 6 sequences from an unspecified genome; no seqlengths

Metadata columns

In GRanges, we can store multiple metadata columns (e.g., gene names, expression levels, scores) by specifying them directly in the GRanges() function.

GR3 = GRanges(seqnames = c("chr1","chr3","chr2")
              , ranges = intervals
              , strand = c("*", "+", "-")
              , score = c(10, 40, 60)
              , expression = c(2.5, 5.0, 3.2)
)
GR3

## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames       ranges strand |     score expression
##          <Rle>    <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1    2000-2040      * |        10        2.5
##   [2]     chr3      900-950      + |        40        5.0
##   [3]     chr2 10000-200000      - |        60        3.2
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

Using mcols( ) function we can view all metadata columns, while using $ accessor we can get values from a specific column

mcols(GR3)

## DataFrame with 3 rows and 2 columns
##       score expression
##   <numeric>  <numeric>
## 1        10        2.5
## 2        40        5.0
## 3        60        3.2

GR3$expression

## [1] 2.5 5.0 3.2

Converting between `GRanges` and `data.frame`

In Bioconductor, we often need to convert GRanges objects to data.frame (for easy manipulation) and vice versa (to restore genomic features).

Convert a GRanges Object to a data.frame : use the as.data.frame() function.
Now, the genomic intervals are stored as a standard data frame, which makes them easier to manipulate.

GR3 = as.data.frame(GR3)
GR3

##   seqnames start    end  width strand score expression
## 1     chr1  2000   2040     41      *    10        2.5
## 2     chr3   900    950     51      +    40        5.0
## 3     chr2 10000 200000 190001      -    60        3.2

Convert a data.frame back to GRanges
If we want to restore the genomic intervals, we must convert the data frame back into a GRanges object.
The makeGRangesFromDataFrame() function automatically converts a data.frame into a GRanges object, making sure that genomic coordinates and metadata are correctly mapped:

GR3 = makeGRangesFromDataFrame(GR3,
                               , keep.extra.columns = TRUE           # Keep metadata columns
                               , ignore.strand = FALSE               # Use strand information
                               , seqinfo = NULL                      # Optional: Provide Seqinfo object for chromosome lengths
                               , seqnames.field = c("seqnames")      # Column storing chromosome names
                               , start.field = "start"               # Names for the column in df that contains the start positions of the genomic ranges
                               , end.field = c("end")                # Names for the column in df that contains the end positions of the genomic ranges. 
                               , strand.field = "strand"             # Names for the column in df that contains the start positions of the genomic ranges. 
                               , starts.in.df.are.0based = FALSE     # Set to TRUE if 0-based coordinates 
)

GR3

## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames       ranges strand |     score expression
##          <Rle>    <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1    2000-2040      * |        10        2.5
##   [2]     chr3      900-950      + |        40        5.0
##   [3]     chr2 10000-200000      - |        60        3.2
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

Feature	GRanges	data.frame
Stores Genomic Ranges	Yes (`IRanges` for intervals)	No (uses numeric start/end)
Metadata Support	Yes `(mcols(gr))`	Yes (columns)
Strand Info	Yes `(strand(gr))`	No (stored as column)
Easier to Analyze in R	More specialized	Standard data manipulation

2025-03-18

Shifting genomic ranges in `GRanges`

The shift() function allows us to move genomic intervals by a specified number of bases without changing their length.
This can be useful for tasks like simulating mutations or adjusting genomic annotations.

GR3[1]

## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     score expression
##          <Rle> <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1 2000-2040      * |        10        2.5
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

GenomicRanges::shift(GR3[1], shift = 40)

## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     score expression
##          <Rle> <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1 2040-2080      * |        10        2.5
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

We can apply shift() to all genomic intervals at once:

GenomicRanges::shift(GR3, shift = 50)

## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames       ranges strand |     score expression
##          <Rle>    <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1    2050-2090      * |        10        2.5
##   [2]     chr3     950-1000      + |        40        5.0
##   [3]     chr2 10050-200050      - |        60        3.2
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

If needed, we can adjust shift direction based on strand:

shift(GR3[strand(GR3) == "+"], shift = 50)
shift(GR3[strand(GR3) == "-"], shift = -50)

Resizing genomic ranges

The resize() function allows us to change the width (length) of genomic intervals while specifying whether to fix the start, end, or center of the interval.

resize() is strand-aware, meaning:

On the “+” strand: start is the beginning of the interval
On the “-” strand: start refers to the rightmost coordinate (end).

GR3[1]

## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     score expression
##          <Rle> <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1 2000-2040      * |        10        2.5
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

resize(GR3[1], width = 100, fix = "start")

## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     score expression
##          <Rle> <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1 2000-2099      * |        10        2.5
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

resize(GR3[1], width = 100, fix = "end")

## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     score expression
##          <Rle> <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1 1941-2040      * |        10        2.5
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

resize(GR3[1], width = 100, fix = "center")

## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     score expression
##          <Rle> <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr1 1970-2069      * |        10        2.5
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

GR3[3]

## GRanges object with 1 range and 2 metadata columns:
##       seqnames       ranges strand |     score expression
##          <Rle>    <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr2 10000-200000      - |        60        3.2
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

resize(GR3[3], width = 100,fix = "start")

## GRanges object with 1 range and 2 metadata columns:
##       seqnames        ranges strand |     score expression
##          <Rle>     <IRanges>  <Rle> | <numeric>  <numeric>
##   [1]     chr2 199901-200000      - |        60        3.2
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

`fix` Option	Positive Strand (+)	Negative Strand (-)	Unstranded (*)
“start”	Keeps start fixed, expands to the right	Keeps start fixed (which is actually the “end”), expands to the left	Keeps start fixed
“end”	Keeps end fixed, expands to the left	Keeps end fixed (which is actually the “start”), expands to the right	Keeps end fixed
“center”	Expands symmetrically around the midpoint	Expands symmetrically around the midpoint	Expands symmetrically

Merging overlapping genomic ranges

The reduce() function in GenomicRanges is used to merge overlapping genomic intervals while preserving chromosome and strand information.

If ignore.strand = TRUE, merges intervals across strands.

GR4 = GRanges(seqnames = c("chr1", "chr1", "chr1")
              , strand = c("+", "-", "-")
              , ranges = IRanges(start = c(1,8,20), end = c(24, 50, 100))
              )

# Merge overlapping regions (strand-aware)
GenomicRanges::reduce(GR4)

## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1      1-24      +
##   [2]     chr1     8-100      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# Merge ignoring strand
GenomicRanges::reduce(GR4, ignore.strand = TRUE)

## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1     1-100      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Finding overlapping genomic ranges

The findOverlaps() function in GenomicRanges helps identify which genomic intervals overlap between two GRanges Objects.
It returns a Hits object, which stores indices of overlapping elements from both datasets.

Specifically, we define:

Query: first GRanges Object
Subject: second GRanges Object

By applying findOverlaps() we get a Hits object that stores:

Query indices: Indices of overlapping elements in Query
Subject indices: Indices of overlapping elements in Subject

We can extract those indices using:

queryHits(overlap): Indices of overlapping elements in Query
subjectHits(overlap): Indices of overlapping elements in Subject

GR4 = GRanges(seqnames = c("chr1", "chr1", "chr1")
              , strand = c("+", "-", "-")
              , ranges = IRanges(start = c(1,8,20), end = c(24, 50, 100))
              )
GR5 = GRanges(seqnames = c("chr1", "chr1", "chr1")
              , strand = c("+", "+", "+")
              , ranges = IRanges(start = c(5,10,100), end = c(60, 20, 150))
              )

overlap = findOverlaps(GR4, GR5)
queryHits(overlap)

## [1] 1 1

subjectHits(overlap)

## [1] 1 2

These functions give us the indices of overlapping intervals, so we can use them to retrieve the overlapping ranges:

GR4[queryHits(overlap)]

## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1      1-24      +
##   [2]     chr1      1-24      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

GR5[subjectHits(overlap)]

## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1      5-60      +
##   [2]     chr1     10-20      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

If we want to ignore strand when finding overlaps, we use ignore.strand = TRUE:

findOverlaps(GR4, GR5, ignore.strand = TRUE)

This considers + and - strands as the same and finds more matches.

Importing and exporting in BED format

Genomic intervals are often stored in BED files.
The rtracklayer package provides functions to import BED files as GRanges objects and export GRanges objects as BED files:

import.bed("file.bed") : Reads a BED file and converts it into a GRanges object
export.bed(gr_object, "file.bed"): Saves a GRanges object as a BED file

Combining GenomicRanges and BSgenome to extract sequences

We have learned how to extract specific genomic sequences using:

subseq() or []: Extracting sub-sequences from a DNAString object
BSgenome + GenomicRanges: Extracting sequences across the entire genome

With GenomicRanges, we can retrieve sequences for multiple genomic intervals at once from a BSgenome reference genome.

For example, if we want to extract sequences from hg19 human genome:

library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg19)
library(Biostrings)

# Define genomic intervals in a GRanges object
GR6 = GRanges(seqnames = c("chr1", "chr1", "chr1")
              , strand = c("+", "-", "-")
              , ranges = IRanges(start = c(100000,800000,2000000), end = c(100020,800030,2000040))
              )

# Extract sequences from the reference genome
sequences = getSeq(BSgenome.Hsapiens.UCSC.hg19, GR6)
sequences

## DNAStringSet object of length 3:
##     width seq
## [1]    21 CACTAAGCACACAGAGAATAA
## [2]    31 TTAAAAGCTTAAAGCAAACAGAAAGACATGC
## [3]    41 GCGTGACCCTCCTTTGGAAAAGGGTTCCTGCACGTGTGATT

The result is a DNAStringSet object containing sequences for each genomic interval.

We can export the extracted sequences in FASTA format using Biostrings::writeXStringSet().

# Assign sequence names
names(sequences) <- c("region1", "region2", "region3")

# Save sequences to a FASTA file
writeXStringSet(sequences, "extracted_sequences.fasta")

A brief recap

The central dogma of biology

The cell is a highly complex system, where essential biological processes occur, including:

regulation of gene expression
transport of molecules
synthesis of biomolecules
conversion of chemical energy

Gene expression involves a sequence of molecular transformations: DNA → RNA → Protein

In 1956, Francis Crick described this directional flow of genetic information as the “central dogma” of molecular biology.

This concept forms the foundation of molecular genetics, explaining how genetic instructions in DNA are transcribed into RNA and translated into proteins.

Alberts et al. 2002, the pathway from DNA to protein

The structure of a transcript

Once a precursor messenger RNA (pre-mRNA) is synthesized, it undergoes several modifications before being transported out of the nucleus for translation.

A key step in mRNA maturation is splicing, a process where introns (non-coding sequences) are removed whereas exons (coding sequences) are joined together.
The final product is a mature messenger RNA (mRNA), which can be translated into a functional protein.

More than 95% of human genes are transcribed into multiple isoforms (mature transcripts) that encode for different proteins, thus expanding protein diversity. This mechanism is called alternative splicing.

Chimica online, alternative splicing

Key takeaways:

Exons are the building blocks of transcripts: They contain the coding sequences that determine protein structure.
95% of genes produce multiple transcripts: Through alternative splicing, a single gene can generate different mRNA isoforms, leading to greater protein diversity.

Genes as genomic features

Genomic features, such as genes, transcripts, and exons, are stored in GTF (Gene Transfer Format) or GFF (General Feature Format) files.
These files allow us to reconstruct gene models, which represent the structure and organization of genes within the genome.

Gene models can be accessed using the TxDb (Transcript Database) packages in Bioconductor.
The standard format for TxDb packages follows this structure: TxDb.<species>.<source>.<major version>.<table>

BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")

If you have a GTF file stored locally, you can import it and convert it into a TxDb object using the makeTxDbFromGFF() function from the GenomicFeatures package:

library(GenomicFeatures)
txdb <- makeTxDbFromGFF(file = "GENCODE_v44.gtf")

This allows you to analyze custom gene annotations or work with specific genome versions.

TxDb Object: Exploring Gene Models

A TxDb object stores structured genomic annotations, including gene, transcript, exon, and CDS (coding sequence) information.

Typing the name of a TxDb object provides an overview of the gene model it contains:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
TxDb.Hsapiens.UCSC.hg19.knownGene

## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1

This displays key metadata such as:

The genome version (e.g., hg19)
The data source (e.g., UCSC, Ensembl)
The number of chromosomes, genes, transcripts, and exons

Interrogating a `TxDb object`

A TxDb object allows us to extract various genomic features such as genes, transcripts, exons, coding sequences (CDS), and promoters.

The output of these queries is typically a GRanges Object, which makes it easy to manipulate genomic intervals.

Feature	Function Call
Genes	`genes(txdb)`
Transcripts	`transcripts(txdb)`
Exons	`exons(txdb)`
CDS (Coding Sequences)	`cds(txdb)`
Promoters	`promoters(txdb, upstream, downstream)`

We can retrieve transcript IDs and corresponding gene IDs by specifying additional columns in the query:

transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene, columns = c("gene_id","tx_id"))

## GRanges object with 82960 ranges and 2 metadata columns:
##                 seqnames        ranges strand |         gene_id     tx_id
##                    <Rle>     <IRanges>  <Rle> | <CharacterList> <integer>
##       [1]           chr1   11874-14409      + |       100287102         1
##       [2]           chr1   11874-14409      + |       100287102         2
##       [3]           chr1   11874-14409      + |       100287102         3
##       [4]           chr1   69091-70008      + |           79501         4
##       [5]           chr1 321084-321115      + |                         5
##       ...            ...           ...    ... .             ...       ...
##   [82956] chrUn_gl000237        1-2686      - |                     82956
##   [82957] chrUn_gl000241   20433-36875      - |                     82957
##   [82958] chrUn_gl000243   11501-11530      + |                     82958
##   [82959] chrUn_gl000243   13608-13637      + |                     82959
##   [82960] chrUn_gl000247     5787-5816      - |                     82960
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

The promoters() function identifies regions surrounding the transcription start site (TSS). It requires specifying: - Upstream (e.g., 1000 bases) - Downstream (e.g., 100 bases)

promoters(TxDb.Hsapiens.UCSC.hg19.knownGene
          , upstream = 1000
          , downstream = 100)

## GRanges object with 82960 ranges and 2 metadata columns:
##                    seqnames        ranges strand |     tx_id     tx_name
##                       <Rle>     <IRanges>  <Rle> | <integer> <character>
##   uc001aaa.3           chr1   10874-11973      + |         1  uc001aaa.3
##   uc010nxq.1           chr1   10874-11973      + |         2  uc010nxq.1
##   uc010nxr.1           chr1   10874-11973      + |         3  uc010nxr.1
##   uc001aal.1           chr1   68091-69190      + |         4  uc001aal.1
##   uc001aaq.2           chr1 320084-321183      + |         5  uc001aaq.2
##          ...            ...           ...    ... .       ...         ...
##   uc011mgu.1 chrUn_gl000237     2587-3686      - |     82956  uc011mgu.1
##   uc011mgv.2 chrUn_gl000241   36776-37875      - |     82957  uc011mgv.2
##   uc011mgw.1 chrUn_gl000243   10501-11600      + |     82958  uc011mgw.1
##   uc022brq.1 chrUn_gl000243   12608-13707      + |     82959  uc022brq.1
##   uc022brr.1 chrUn_gl000247     5717-6816      - |     82960  uc022brr.1
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

The transcriptLengths() function computes the sum of exon lengths in each transcript (not including introns).

head(transcriptLengths(TxDb.Hsapiens.UCSC.hg19.knownGene))

##   tx_id    tx_name   gene_id nexon tx_len
## 1     1 uc001aaa.3 100287102     3   1652
## 2     2 uc010nxq.1 100287102     3   1488
## 3     3 uc010nxr.1 100287102     3   1595
## 4     4 uc001aal.1     79501     1    918
## 5     5 uc001aaq.2      <NA>     1     32
## 6     6 uc001aar.2      <NA>     1     62

Note: This function measures the processed transcript length (sum of exonic regions), not the full genomic region transcribed into RNA. To get the total transcribed DNA length, use:

head(width(transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)))

## [1] 2536 2536 2536  918   32   62

Extracting information

The select() function allows us to retrieve specific annotations from a TxDb object, such as transcript names, gene IDs, or exon details.
In particular, we can:

Check available data fields
- Use columns() to list available information (e.g., transcript names, gene lengths)
- Use keytypes() to see which column types can be used as keys (e.g., GENEID, TXID)

Define your key of interest: create a vector of gene IDs or transcript IDs for querying.

Use select() to extract data

For example:

# Step 1: Define gene IDs of interest  
keys <- c("100033416", "100033417", "100033420")  

# Step 2: Check available columns and key types  
columns(TxDb.Hsapiens.UCSC.hg19.knownGene)  # Lists all available metadata fields

##  [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"  
##  [6] "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"  
## [11] "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"   
## [16] "TXEND"      "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"  
## [21] "TXTYPE"

keytypes(TxDb.Hsapiens.UCSC.hg19.knownGene) # Lists valid key types

## [1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"     "TXNAME"

# Step 3: Retrieve transcript names for selected gene IDs  
select(TxDb.Hsapiens.UCSC.hg19.knownGene, keys = keys, columns = "TXNAME", keytype = "GENEID")

##      GENEID     TXNAME
## 1 100033416 uc001yxl.4
## 2 100033417 uc001yxo.3
## 3 100033420 uc001yxr.3

Combining GenomicFeatures and BSgenome for Transcript Sequences

We can extract transcript sequences by combining a TxDb object (which contains gene and transcripts annotations) with a BSgenome object (which stores genome sequences).

The function extractTranscriptSeqs() allows us to extract full transcript sequences using a reference genome.

extractTranscriptSeqs(x = BSgenome.Hsapiens.UCSC.hg19, transcripts = TxDb.Hsapiens.UCSC.hg19.knownGene)

## DNAStringSet object of length 82960:
##          width seq                                          names               
##     [1]   1652 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 1
##     [2]   1488 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 2
##     [3]   1595 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 3
##     [4]    918 ATGGTGACTGAATTCATTTTT...ATTCTAGTGTAAAGTTTTAG 4
##     [5]     32 TACAGACCAAGCTCATGACTCACAATGGCCTA             5
##     ...    ... ...
## [82956]   1217 GCCAGTTTAGGGTCTCTGGTA...AACCTCCGCCTCCTGAGATC 82956
## [82957]    737 CTCCACTTCTGATCCTCCCCG...CCGCTTTATTAGATGCAGTG 82957
## [82958]     30 TGGTGAATTTCAGCCAAAGTGGCCAAAGAA               82958
## [82959]     30 TTGCAGAGGTGGCTGGTTGCTCTTTGAGCC               82959
## [82960]     30 TGGTGAATTTCAGCCAAAGTGGCCAAAGAA               82960

Key applications:

Analyze transcript sequences from genome data
Compare transcript isoforms
Extract sequences for further downstream analysis (e.g., translation, motif search)

Extracting features as a `GRangesList`

If we need to group genomic features, we can use the following functions, which return a GRangesList:

transcriptsBy(): Groups transcripts by gene or other features
exonsBy(): Groups exons by transcript or gene
cdsBy(): Groups CDS (coding sequences) by transcript or gene

These will return a GRangesList where each element contains the exons of a specific transcript.

exonsBy(x = TxDb.Hsapiens.UCSC.hg19.knownGene, by = "tx")[1:2]

## GRangesList object of length 2:
## $`1`
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames      ranges strand |   exon_id   exon_name exon_rank
##          <Rle>   <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr1 11874-12227      + |         1        <NA>         1
##   [2]     chr1 12613-12721      + |         3        <NA>         2
##   [3]     chr1 13221-14409      + |         5        <NA>         3
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## $`2`
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames      ranges strand |   exon_id   exon_name exon_rank
##          <Rle>   <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr1 11874-12227      + |         1        <NA>         1
##   [2]     chr1 12595-12721      + |         2        <NA>         2
##   [3]     chr1 13403-14409      + |         6        <NA>         3
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

We can change the by parameter to group differently:

By transcript = tx
By gene = gene

Export a `TxDb Object` as a GTF/GFF file

If we need to export a TxDb object (gene annotation data) into a GTF or GFF file, we can use the export.gff() function from the rtracklayer package.

export.gff(TxDb.Hsapiens.UCSC.hg19.knownGene, con = "file.gtf", format ="gtf")
export.gff(TxDb.Hsapiens.UCSC.hg19.knownGene, con = "file.gff", format ="gff")

GTF (Gene Transfer Format) is a widely used annotation format in gene expression analysis.
GFF (General Feature Format) is a more general format used in genomic annotations.

Bioconductor

2025-03-17

Bioconductor

Bioconductor webpage

Bioconductor structure

Installing Bioconductor and its packages

Genomic data types

Genomic sequences

BSgenome package

DNAString object

Biostrings package

Creating and reading a FASTA

Genomic intervals (BED files)

Example of a BED file

Manipulating genomic intervals

Creating a GenomicRanges (GRanges) object in R

Retrieving information and modifying values in GenomicRanges

Adding strand and names to a GRanges Object

Metadata columns

Converting between GRanges and data.frame

2025-03-18

Shifting genomic ranges in GRanges

Resizing genomic ranges

Merging overlapping genomic ranges

Finding overlapping genomic ranges

Importing and exporting in BED format

Combining GenomicRanges and BSgenome to extract sequences

A brief recap

The central dogma of biology

The structure of a transcript

Genes as genomic features

TxDb Object: Exploring Gene Models

Interrogating a TxDb object

Extracting information

Combining GenomicFeatures and BSgenome for Transcript Sequences

Extracting features as a GRangesList

Export a TxDb Object as a GTF/GFF file

Creating a `GenomicRanges (GRanges) object` in R

Retrieving information and modifying values in `GenomicRanges`

Adding strand and names to a `GRanges Object`

Converting between `GRanges` and `data.frame`

Shifting genomic ranges in `GRanges`

Interrogating a `TxDb object`

Extracting features as a `GRangesList`

Export a `TxDb Object` as a GTF/GFF file