Bioconductor is
an open-source software project.
It is community-driven and provides a framework of tools and databases
for the analysis of biological data in R.
In a nutshell:
Its main goals are:
Providing access to statistical and graphical methods for genomic data analysis
Facilitating the integration of metadata into analyses, e.g., literature data from PubMed
Offering a platform for developing new curated packages
Training researchers in computational and statistical methods for biological data analysis
All packages undergo review before release
Follows an open-source and open-development model
Bioconductor packages fall into four main categories:
Software: Tools for data analysis and visualization
Annotation: Resources for integrating biological metadata
Experimental data: Datasets for testing and benchmarking methods
Workflows: Predefined pipelines for common analysis tasks
Each package includes a vignette that explains its functionalities and provides usage examples.
To install Bioconductor, simply run in R console:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.18")
To install Bioconductor packages, use the following command:
BiocManager::install("package_name")
Nevertheless, each Bioconductor package includes a description of the installation command in R, which you can simply copy and paste.
You can also explore available packages by typing:
BiocManager::available( )
Bioconductor enables the manipulation of various biological data types.
The most common data types include:
Genomic sequences (FASTA)
Sequencing files (FASTQ)
Genomic intervals (BED files)
Genomic features (GFF/GTF)
Alignment data (SAM and BAM)
In this course, we will focus on handling genomic sequences and intervals, as well as retrieving genomic features.
The FASTA format is a text-based format used to represent nucleotide sequences (DNA/RNA) or amino acid sequences (proteins). In this format, nucleotides or amino acids are represented using single-letter codes, and each sequence entry can be preceded by a description or comment.
FASTA files consist of alternating lines of two types:
Header line: description of the sequence,
preceded by “>
” (e.g.,
>Sequence_ID Description
)
Sequence data: the actual nucleotide or amino acid sequence
Full genome sequences for many model organisms are
available in the BSgenome
package, providing easy access to
reference genomes for analysis.
Genomic sequences can be processed in Bioconductor using functions
from the Biostrings
package.
To retrieve the genome sequence of a specific species, we use the
BSgenome
package.
BSgenome
is a Biostrings-based genome data
package because the sequences it contains are stored in
fundamental data structures defined within the
Biostrings
package.
This ensures efficient handling and manipulation of large genomic
datasets.
To check which genome packages are available for different species, use:
library(BSgenome)
head(available.genomes( ))
## [1] "BSgenome.Alyrata.JGI.v1"
## [2] "BSgenome.Amellifera.BeeBase.assembly4"
## [3] "BSgenome.Amellifera.NCBI.AmelHAv3.1"
## [4] "BSgenome.Amellifera.UCSC.apiMel2"
## [5] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [6] "BSgenome.Aofficinalis.NCBI.V1"
The BSgenome
package naming format follows this
structure:
BSgenome.<species>.<source>.<major_version>
where:
<species>
refers to the scientific name of the
organism (e.g., Hsapiens for Homo sapiens)
<source>
indicates the genome assembly
provider (e.g., UCSC
or NCBI
)
<major_version>
represents the genome version
(e.g., hg38
for Homo sapiens GRCh38)
If, for example, we want to retrieve human hg19 genome from UCSC database, we need to type the following command in R:
BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
Then, by typing the name of the BSgenome object
in R, we
can retrieve detailed information about the genome:
genome
## | BSgenome object for Human
## | - organism: Homo sapiens
## | - provider: UCSC
## | - genome: hg19
## | - release date: June 2013
## | - 298 sequence(s):
## | chr1 chr2 chr3
## | chr4 chr5 chr6
## | chr7 chr8 chr9
## | chr10 chr11 chr12
## | chr13 chr14 chr15
## | ... ... ...
## | chr19_gl949749_alt chr19_gl949750_alt chr19_gl949751_alt
## | chr19_gl949752_alt chr19_gl949753_alt chr20_gl383577_alt
## | chr21_gl383578_alt chr21_gl383579_alt chr21_gl383580_alt
## | chr21_gl383581_alt chr22_gl383582_alt chr22_gl383583_alt
## | chr22_kb663609_alt
## |
## | Tips: call 'seqnames()' on the object to get all the sequence names, call
## | 'seqinfo()' to get the full sequence info, use the '$' or '[[' operator to
## | access a given sequence, see '?BSgenome' for more information.
This summary provides an overview of the genome and guidance on how to access its sequences.
We can use the following functions to explore the genome structure and retrieve specific sequences:
seqnames()
– Retrieves all chromosome nameshead(seqnames(genome))
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"
seqlengths()
– Extracts chromosome lengthshead(seqlengths(genome))
## chr1 chr2 chr3 chr4 chr5 chr6
## 249250621 243199373 198022430 191154276 180915260 171115067
$
and [[ ]]
accessors – Retrieve a
specific chromosome sequence.chr1_seq <- genome$chr1
chr1_seq
## 249250621-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
chr1_seq_alt <- genome[["chr1"]]
chr1_seq_alt
## 249250621-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
class(chr1_seq)
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"
To subset a DNAString
object, we have two main approaches:
A. Using []
(Bracket Indexing)
chr1 = genome$chr1
chr1[1000000:1000010]
## 11-letter DNAString object
## seq: TGGGCACAGCC
B. Using subseq()
function: more explicit way to extract
a subsequence
sequence <- subseq(chr1, start = 1000000, end = 1000010)
sequence
## 11-letter DNAString object
## seq: TGGGCACAGCC
The Biostrings
package provides a variety of functions
for managing and analyzing genomic sequences efficiently.
Examples of useful functions in Biostrings
:
alphabetFrequency()
A
,
C
, G
, T
, and other
characters.alphabetFrequency(sequence)
## A C G T M R W S Y K V H D B N - + .
## 2 4 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
complement( )
,
reverse()
and reverseComplement( )
complement(sequence)
## 11-letter DNAString object
## seq: ACCCGTGTCGG
reverse(sequence)
## 11-letter DNAString object
## seq: CCGACACGGGT
reverseComplement(sequence)
## 11-letter DNAString object
## seq: GGCTGTGCCCA
translate( )
.translate(sequence)
## 3-letter AAString object
## seq: WAQ
matchPattern( )
countPattern( )
matchPattern("CA", sequence)
## Views on a 11-letter DNAString subject
## subject: TGGGCACAGCC
## views:
## start end width
## [1] 5 6 2 [CA]
## [2] 7 8 2 [CA]
countPattern("CA", sequence)
## [1] 2
By adjusting max.mismatch
and min.mismatch
,
we can tolerate sequencing errors or mutations.
The default setting (max.mismatch = 0, min.mismatch = 0
)
enforces perfect matching.
matchPattern("CAC", sequence, max.mismatch=1)
## Views on a 11-letter DNAString subject
## subject: TGGGCACAGCC
## views:
## start end width
## [1] 5 7 3 [CAC]
## [2] 7 9 3 [CAG]
countPattern("CAC", sequence, max.mismatch=1)
## [1] 2
To create a FASTA file from a sequence in Biostrings
,
follow these steps:
DNAStringSet
object is required to store multiple
sequences in a FASTA format.names()
writeXStringSet()
to save the sequence to a
FASTA file.sequence = DNAStringSet(sequence)
names(sequence) <- "My sequence"
sequence
## DNAStringSet object of length 1:
## width seq names
## [1] 11 TGGGCACAGCC My sequence
writeStringSet(sequence, "My_sequence.fa")
To read a FASTA file back into R, use
readDNAStringSet()
.
The BED (Browser Extensible Data) file format is a widely used tab-delimited format for storing genomic regions.
It is commonly used for annotations and analysis results such as transcription start sites (TSS), ChIP-seq peaks, and other genomic features.
For a more comprehensive description, visit https://genome.ucsc.edu/FAQ/FAQformat.html##format1
Required fields:
Chromosome: name of the chromosome (e.g.,
"chr1"
)
Start Position: 0-based start coordinate (first base is 0)
End Position: 1-based end coordinate (last base is not included)
Optional (but often useful) fields:
Name: Identifier or description of the feature
Score: A number between 0 and 1000 (e.g., ChIP-seq peak intensity)
Strand: "+"
(forward),
"-"
(reverse), or "."
(unknown)
No header line is present in BED files.
BED coordinates are zero-based (start position) and half-open (end position is excluded).
This format is compact and efficient, making it ideal for large genomic datasets.
chr1 1000 1050 GeneX 900 +
chr1 2000 2100 ChIP_peak1 750 -
chr2 3000 3200 Enhancer 500 .
When working with genomic regions, two key Bioconductor packages are:
GenomicRanges
: For creating, manipulating, and
analyzing genomic intervals.
rtracklayer
: For importing and exporting genomic
data (e.g., BED, GFF, BigWig).
GenomicRanges (GRanges) object
in RTo create a GRanges Object
, we first need to
define genomic intervals using the IRanges
package.
library(GenomicRanges)
library(IRanges)
IRanges Object
IRanges
package defines intervals
(start and end positions) without chromosome or strand information.intervals = IRanges(start = c(2000, 900, 10000)
, end = c(2040, 950, 200000)
)
intervals
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 2000 2040 41
## [2] 900 950 51
## [3] 10000 200000 190001
GRanges Object
GRanges Object
, we must associate
IRanges
with chromosomes and strands.GR = GRanges(seqnames = c("chr1","chr3","chr2")
, ranges = intervals
, strand = c("+", "-", "*")
)
GR
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 2000-2040 +
## [2] chr3 900-950 -
## [3] chr2 10000-200000 *
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
We can attach gene names, scores, or other annotations to each region.
mcols(GR) <- DataFrame(gene = c("SF3B1", "SRSF7", "TPX2"),
score = c(900, 750, 500))
GR
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | gene score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 2000-2040 + | SF3B1 900
## [2] chr3 900-950 - | SRSF7 750
## [3] chr2 10000-200000 * | TPX2 500
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
GenomicRanges
A GRanges Object
can be manipulated like a vector,
allowing us to extract, subset, and modify genomic
intervals.
GRanges Object
, use
indexing:GR[1:2]
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | gene score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 2000-2040 + | SF3B1 900
## [2] chr3 900-950 - | SRSF7 750
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
GRanges Object
using functions: start( )
, end( )
,
seqnames( )
, strand()
, width()
,
mcols()
seqnames(GR) # Chromosome names
## factor-Rle of length 3 with 3 runs
## Lengths: 1 1 1
## Values : chr1 chr3 chr2
## Levels(3): chr1 chr3 chr2
start(GR) # Start positions
## [1] 2000 900 10000
end(GR) # End positions
## [1] 2040 950 200000
strand(GR) # Strand information
## factor-Rle of length 3 with 3 runs
## Lengths: 1 1 1
## Values : + - *
## Levels(3): + - *
width(GR) # Interval lengths
## [1] 41 51 190001
mcols(GR) # Metadata (e.g., gene names, scores)
## DataFrame with 3 rows and 2 columns
## gene score
## <character> <numeric>
## 1 SF3B1 900
## 2 SRSF7 750
## 3 TPX2 500
GRanges Object
# Change the chromosome of the first entry
seqnames(GR)[1] <- "chr2"
# Modify the start position of the second interval
# Start must be always minor than End position
start(GR)[2] <- 910
# Change strand of the third interval
strand(GR)[3] <- "+"
# Update metadata (e.g., rename a gene)
mcols(GR)$gene[1] <- "MYC"
In GRanges
, chromosome names (seqnames(gr)
)
are stored as factors.
This means you cannot directly assign a chromosome that is not already
in the object.
Instead, you must add the new chromosome as a factor level first:
seqlevels(GR) # only chr1, chr2 and chr3 are defined
## [1] "chr1" "chr3" "chr2"
seqlevels(GR) <- c("chr1", "chr2", "chr3", "chr8", "chr9", "chr21") # Adding levels
seqnames(GR)[1:3] <- c("chr8", "chr9", "chr21")
GRanges Object
When creating a GRanges object
, we can assign
strand information: +
, -
,
*
GR2 = GRanges(seqnames = c("chr1","chr3","chr2")
, ranges = intervals
, strand = c("*", "+", "-")
)
GR2
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 2000-2040 *
## [2] chr3 900-950 +
## [3] chr2 10000-200000 -
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
We can also modify strand values after creating the object:
strand(GR) = c("*", "+", "-")
GR
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | gene score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr8 2000-2040 * | MYC 900
## [2] chr9 910-950 + | SRSF7 750
## [3] chr21 10000-200000 - | TPX2 500
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
In GRanges
, names are stored in metadata
columns (e.g., mcols(gr)$name
).
names(GR) = c("SFRB1", "FOXA1", "MYC")
GR
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | gene score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## SFRB1 chr8 2000-2040 * | MYC 900
## FOXA1 chr9 910-950 + | SRSF7 750
## MYC chr21 10000-200000 - | TPX2 500
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
As with vectors, we can use names to index and retrieve genomic intervals of interest.
GR["SFRB1"]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | gene score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## SFRB1 chr8 2000-2040 * | MYC 900
## -------
## seqinfo: 6 sequences from an unspecified genome; no seqlengths
In GRanges
, we can store multiple metadata
columns (e.g., gene names, expression levels, scores) by
specifying them directly in the GRanges()
function.
GR3 = GRanges(seqnames = c("chr1","chr3","chr2")
, ranges = intervals
, strand = c("*", "+", "-")
, score = c(10, 40, 60)
, expression = c(2.5, 5.0, 3.2)
)
GR3
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | score expression
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 2000-2040 * | 10 2.5
## [2] chr3 900-950 + | 40 5.0
## [3] chr2 10000-200000 - | 60 3.2
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
Using mcols( )
function we can view all metadata
columns, while using $
accessor we can get values from a
specific column
mcols(GR3)
## DataFrame with 3 rows and 2 columns
## score expression
## <numeric> <numeric>
## 1 10 2.5
## 2 40 5.0
## 3 60 3.2
GR3$expression
## [1] 2.5 5.0 3.2
GRanges
and
data.frame
In Bioconductor
, we often need to
convert GRanges objects
to
data.frame
(for easy manipulation) and
vice versa (to restore genomic features).
GRanges Object
to a data.frame
:
use the as.data.frame()
function.GR3 = as.data.frame(GR3)
GR3
## seqnames start end width strand score expression
## 1 chr1 2000 2040 41 * 10 2.5
## 2 chr3 900 950 51 + 40 5.0
## 3 chr2 10000 200000 190001 - 60 3.2
data.frame
back to GRanges
GRanges
object.makeGRangesFromDataFrame()
function automatically
converts a data.frame
into a GRanges object
,
making sure that genomic coordinates and metadata are correctly
mapped:GR3 = makeGRangesFromDataFrame(GR3,
, keep.extra.columns = TRUE # Keep metadata columns
, ignore.strand = FALSE # Use strand information
, seqinfo = NULL # Optional: Provide Seqinfo object for chromosome lengths
, seqnames.field = c("seqnames") # Column storing chromosome names
, start.field = "start" # Names for the column in df that contains the start positions of the genomic ranges
, end.field = c("end") # Names for the column in df that contains the end positions of the genomic ranges.
, strand.field = "strand" # Names for the column in df that contains the start positions of the genomic ranges.
, starts.in.df.are.0based = FALSE # Set to TRUE if 0-based coordinates
)
GR3
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | score expression
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 2000-2040 * | 10 2.5
## [2] chr3 900-950 + | 40 5.0
## [3] chr2 10000-200000 - | 60 3.2
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
Feature | GRanges | data.frame |
---|---|---|
Stores Genomic Ranges | Yes (IRanges for intervals) |
No (uses numeric start/end) |
Metadata Support | Yes (mcols(gr)) |
Yes (columns) |
Strand Info | Yes (strand(gr)) |
No (stored as column) |
Easier to Analyze in R | More specialized | Standard data manipulation |
GRanges
The shift()
function allows us to move genomic
intervals by a specified number of bases without changing their
length.
This can be useful for tasks like simulating mutations or adjusting
genomic annotations.
GR3[1]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | score expression
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 2000-2040 * | 10 2.5
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
GenomicRanges::shift(GR3[1], shift = 40)
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | score expression
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 2040-2080 * | 10 2.5
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
We can apply shift()
to all genomic intervals at
once:
GenomicRanges::shift(GR3, shift = 50)
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | score expression
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 2050-2090 * | 10 2.5
## [2] chr3 950-1000 + | 40 5.0
## [3] chr2 10050-200050 - | 60 3.2
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
If needed, we can adjust shift direction based on strand:
shift(GR3[strand(GR3) == "+"], shift = 50)
shift(GR3[strand(GR3) == "-"], shift = -50)
The resize()
function allows us to change the width
(length) of genomic intervals while specifying whether to fix the start,
end, or center of the interval.
resize()
is strand-aware, meaning:
start
is the beginning of the
intervalstart
refers to the rightmost
coordinate (end).GR3[1]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | score expression
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 2000-2040 * | 10 2.5
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
resize(GR3[1], width = 100, fix = "start")
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | score expression
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 2000-2099 * | 10 2.5
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
resize(GR3[1], width = 100, fix = "end")
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | score expression
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 1941-2040 * | 10 2.5
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
resize(GR3[1], width = 100, fix = "center")
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | score expression
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 1970-2069 * | 10 2.5
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
GR3[3]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | score expression
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr2 10000-200000 - | 60 3.2
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
resize(GR3[3], width = 100,fix = "start")
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | score expression
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr2 199901-200000 - | 60 3.2
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
fix Option |
Positive Strand (+) | Negative Strand (-) | Unstranded (*) |
---|---|---|---|
“start” | Keeps start fixed, expands to the right | Keeps start fixed (which is actually the “end”), expands to the left | Keeps start fixed |
“end” | Keeps end fixed, expands to the left | Keeps end fixed (which is actually the “start”), expands to the right | Keeps end fixed |
“center” | Expands symmetrically around the midpoint | Expands symmetrically around the midpoint | Expands symmetrically |
The reduce()
function in GenomicRanges
is
used to merge overlapping genomic intervals while preserving chromosome
and strand information.
If ignore.strand = TRUE
, merges intervals across
strands.
GR4 = GRanges(seqnames = c("chr1", "chr1", "chr1")
, strand = c("+", "-", "-")
, ranges = IRanges(start = c(1,8,20), end = c(24, 50, 100))
)
# Merge overlapping regions (strand-aware)
GenomicRanges::reduce(GR4)
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 1-24 +
## [2] chr1 8-100 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Merge ignoring strand
GenomicRanges::reduce(GR4, ignore.strand = TRUE)
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 1-100 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The findOverlaps()
function in
GenomicRanges
helps identify which genomic intervals
overlap between two GRanges Objects
.
It returns a Hits object
, which stores indices of
overlapping elements from both datasets.
Specifically, we define:
GRanges Object
GRanges Object
By applying findOverlaps()
we get a Hits
object that stores:
Query indices: Indices of overlapping elements in
Query
Subject indices: Indices of overlapping elements in
Subject
We can extract those indices using:
queryHits(overlap)
: Indices of overlapping elements
in Query
subjectHits(overlap)
: Indices of overlapping
elements in Subject
GR4 = GRanges(seqnames = c("chr1", "chr1", "chr1")
, strand = c("+", "-", "-")
, ranges = IRanges(start = c(1,8,20), end = c(24, 50, 100))
)
GR5 = GRanges(seqnames = c("chr1", "chr1", "chr1")
, strand = c("+", "+", "+")
, ranges = IRanges(start = c(5,10,100), end = c(60, 20, 150))
)
overlap = findOverlaps(GR4, GR5)
queryHits(overlap)
## [1] 1 1
subjectHits(overlap)
## [1] 1 2
These functions give us the indices of overlapping intervals, so we can use them to retrieve the overlapping ranges:
GR4[queryHits(overlap)]
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 1-24 +
## [2] chr1 1-24 +
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
GR5[subjectHits(overlap)]
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 5-60 +
## [2] chr1 10-20 +
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
If we want to ignore strand when finding overlaps, we use
ignore.strand = TRUE
:
findOverlaps(GR4, GR5, ignore.strand = TRUE)
This considers +
and -
strands as the same
and finds more matches.
Genomic intervals are often stored in BED files.
The rtracklayer
package provides functions to import BED
files as GRanges objects
and export
GRanges objects
as BED files:
import.bed("file.bed")
: Reads a BED file and converts
it into a GRanges object
export.bed(gr_object, "file.bed")
: Saves a
GRanges object
as a BED file
We have learned how to extract specific genomic sequences using:
subseq()
or []
: Extracting
sub-sequences from a DNAString object
BSgenome
+ GenomicRanges
: Extracting
sequences across the entire genome
With GenomicRanges
, we can retrieve sequences for
multiple genomic intervals at once from a BSgenome
reference genome.
For example, if we want to extract sequences from hg19 human genome:
library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg19)
library(Biostrings)
# Define genomic intervals in a GRanges object
GR6 = GRanges(seqnames = c("chr1", "chr1", "chr1")
, strand = c("+", "-", "-")
, ranges = IRanges(start = c(100000,800000,2000000), end = c(100020,800030,2000040))
)
# Extract sequences from the reference genome
sequences = getSeq(BSgenome.Hsapiens.UCSC.hg19, GR6)
sequences
## DNAStringSet object of length 3:
## width seq
## [1] 21 CACTAAGCACACAGAGAATAA
## [2] 31 TTAAAAGCTTAAAGCAAACAGAAAGACATGC
## [3] 41 GCGTGACCCTCCTTTGGAAAAGGGTTCCTGCACGTGTGATT
The result is a DNAStringSet object
containing sequences
for each genomic interval.
We can export the extracted sequences in FASTA format using
Biostrings::writeXStringSet()
.
# Assign sequence names
names(sequences) <- c("region1", "region2", "region3")
# Save sequences to a FASTA file
writeXStringSet(sequences, "extracted_sequences.fasta")
The cell is a highly complex system, where essential biological processes occur, including:
Gene expression involves a sequence of molecular transformations: DNA
→ RNA → Protein
In 1956, Francis Crick described this directional flow of genetic information as the “central dogma” of molecular biology.
This concept forms the foundation of molecular genetics, explaining how genetic instructions in DNA are transcribed into RNA and translated into proteins.
Once a precursor messenger RNA (pre-mRNA) is synthesized, it undergoes several modifications before being transported out of the nucleus for translation.
A key step in mRNA maturation is splicing, a process
where introns (non-coding sequences) are removed
whereas exons (coding sequences) are joined
together.
The final product is a mature messenger RNA (mRNA),
which can be translated into a functional protein.
More than 95% of human genes are transcribed into multiple isoforms (mature transcripts) that encode for different proteins, thus expanding protein diversity. This mechanism is called alternative splicing.
Key takeaways:
Exons are the building blocks of transcripts: They contain the coding sequences that determine protein structure.
95% of genes produce multiple transcripts: Through alternative splicing, a single gene can generate different mRNA isoforms, leading to greater protein diversity.
Genomic features, such as genes, transcripts, and
exons, are stored in GTF
(Gene Transfer Format) or
GFF
(General Feature Format) files.
These files allow us to reconstruct gene models, which represent the
structure and organization of genes
within the genome.
Gene models can be accessed using the
TxDb (Transcript Database)
packages in Bioconductor.
The standard format for TxDb packages
follows this
structure:
TxDb.<species>.<source>.<major version>.<table>
BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
If you have a GTF
file stored locally,
you can import it and convert it into a TxDb object
using
the makeTxDbFromGFF()
function from the
GenomicFeatures
package:
library(GenomicFeatures)
txdb <- makeTxDbFromGFF(file = "GENCODE_v44.gtf")
This allows you to analyze custom gene annotations or work with specific genome versions.
A TxDb object
stores structured genomic annotations,
including gene, transcript, exon, and CDS (coding sequence)
information.
Typing the name of a TxDb object
provides an overview of
the gene model it contains:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
TxDb.Hsapiens.UCSC.hg19.knownGene
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1
This displays key metadata such as:
The genome version (e.g., hg19)
The data source (e.g., UCSC, Ensembl)
The number of chromosomes, genes, transcripts, and exons
TxDb object
A TxDb object
allows us to extract various genomic
features such as genes, transcripts, exons, coding sequences (CDS), and
promoters.
The output of these queries is typically a
GRanges Object
, which makes it easy to manipulate genomic
intervals.
Feature | Function Call |
---|---|
Genes | genes(txdb) |
Transcripts | transcripts(txdb) |
Exons | exons(txdb) |
CDS (Coding Sequences) | cds(txdb) |
Promoters | promoters(txdb, upstream, downstream) |
We can retrieve transcript IDs and corresponding gene IDs by specifying additional columns in the query:
transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene, columns = c("gene_id","tx_id"))
## GRanges object with 82960 ranges and 2 metadata columns:
## seqnames ranges strand | gene_id tx_id
## <Rle> <IRanges> <Rle> | <CharacterList> <integer>
## [1] chr1 11874-14409 + | 100287102 1
## [2] chr1 11874-14409 + | 100287102 2
## [3] chr1 11874-14409 + | 100287102 3
## [4] chr1 69091-70008 + | 79501 4
## [5] chr1 321084-321115 + | 5
## ... ... ... ... . ... ...
## [82956] chrUn_gl000237 1-2686 - | 82956
## [82957] chrUn_gl000241 20433-36875 - | 82957
## [82958] chrUn_gl000243 11501-11530 + | 82958
## [82959] chrUn_gl000243 13608-13637 + | 82959
## [82960] chrUn_gl000247 5787-5816 - | 82960
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
The promoters()
function identifies regions surrounding
the transcription start site (TSS). It requires specifying: -
Upstream
(e.g., 1000 bases) - Downstream
(e.g., 100 bases)
promoters(TxDb.Hsapiens.UCSC.hg19.knownGene
, upstream = 1000
, downstream = 100)
## GRanges object with 82960 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## uc001aaa.3 chr1 10874-11973 + | 1 uc001aaa.3
## uc010nxq.1 chr1 10874-11973 + | 2 uc010nxq.1
## uc010nxr.1 chr1 10874-11973 + | 3 uc010nxr.1
## uc001aal.1 chr1 68091-69190 + | 4 uc001aal.1
## uc001aaq.2 chr1 320084-321183 + | 5 uc001aaq.2
## ... ... ... ... . ... ...
## uc011mgu.1 chrUn_gl000237 2587-3686 - | 82956 uc011mgu.1
## uc011mgv.2 chrUn_gl000241 36776-37875 - | 82957 uc011mgv.2
## uc011mgw.1 chrUn_gl000243 10501-11600 + | 82958 uc011mgw.1
## uc022brq.1 chrUn_gl000243 12608-13707 + | 82959 uc022brq.1
## uc022brr.1 chrUn_gl000247 5717-6816 - | 82960 uc022brr.1
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
The transcriptLengths()
function computes the sum of
exon lengths in each transcript (not including
introns).
head(transcriptLengths(TxDb.Hsapiens.UCSC.hg19.knownGene))
## tx_id tx_name gene_id nexon tx_len
## 1 1 uc001aaa.3 100287102 3 1652
## 2 2 uc010nxq.1 100287102 3 1488
## 3 3 uc010nxr.1 100287102 3 1595
## 4 4 uc001aal.1 79501 1 918
## 5 5 uc001aaq.2 <NA> 1 32
## 6 6 uc001aar.2 <NA> 1 62
Note: This function measures the processed transcript length (sum of exonic regions), not the full genomic region transcribed into RNA. To get the total transcribed DNA length, use:
head(width(transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)))
## [1] 2536 2536 2536 918 32 62
The select()
function allows us to retrieve specific
annotations from a TxDb object
, such as transcript names,
gene IDs, or exon details.
In particular, we can:
Use columns()
to list available information (e.g.,
transcript names, gene lengths)
Use keytypes()
to see which column types can be used
as keys (e.g., GENEID, TXID)
select()
to extract data
For example:
# Step 1: Define gene IDs of interest
keys <- c("100033416", "100033417", "100033420")
# Step 2: Check available columns and key types
columns(TxDb.Hsapiens.UCSC.hg19.knownGene) # Lists all available metadata fields
## [1] "CDSCHROM" "CDSEND" "CDSID" "CDSNAME" "CDSSTART"
## [6] "CDSSTRAND" "EXONCHROM" "EXONEND" "EXONID" "EXONNAME"
## [11] "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID" "TXCHROM"
## [16] "TXEND" "TXID" "TXNAME" "TXSTART" "TXSTRAND"
## [21] "TXTYPE"
keytypes(TxDb.Hsapiens.UCSC.hg19.knownGene) # Lists valid key types
## [1] "CDSID" "CDSNAME" "EXONID" "EXONNAME" "GENEID" "TXID" "TXNAME"
# Step 3: Retrieve transcript names for selected gene IDs
select(TxDb.Hsapiens.UCSC.hg19.knownGene, keys = keys, columns = "TXNAME", keytype = "GENEID")
## GENEID TXNAME
## 1 100033416 uc001yxl.4
## 2 100033417 uc001yxo.3
## 3 100033420 uc001yxr.3
We can extract transcript sequences by combining a
TxDb object
(which contains gene and transcripts
annotations) with a BSgenome object
(which stores genome
sequences).
The function extractTranscriptSeqs()
allows us to
extract full transcript sequences using a reference genome.
extractTranscriptSeqs(x = BSgenome.Hsapiens.UCSC.hg19, transcripts = TxDb.Hsapiens.UCSC.hg19.knownGene)
## DNAStringSet object of length 82960:
## width seq names
## [1] 1652 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 1
## [2] 1488 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 2
## [3] 1595 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 3
## [4] 918 ATGGTGACTGAATTCATTTTT...ATTCTAGTGTAAAGTTTTAG 4
## [5] 32 TACAGACCAAGCTCATGACTCACAATGGCCTA 5
## ... ... ...
## [82956] 1217 GCCAGTTTAGGGTCTCTGGTA...AACCTCCGCCTCCTGAGATC 82956
## [82957] 737 CTCCACTTCTGATCCTCCCCG...CCGCTTTATTAGATGCAGTG 82957
## [82958] 30 TGGTGAATTTCAGCCAAAGTGGCCAAAGAA 82958
## [82959] 30 TTGCAGAGGTGGCTGGTTGCTCTTTGAGCC 82959
## [82960] 30 TGGTGAATTTCAGCCAAAGTGGCCAAAGAA 82960
Key applications:
Analyze transcript sequences from genome data
Compare transcript isoforms
Extract sequences for further downstream analysis (e.g., translation, motif search)
GRangesList
If we need to group genomic features, we can use the
following functions, which return a GRangesList
:
transcriptsBy()
: Groups transcripts by gene or other
featuresexonsBy()
: Groups exons by transcript or genecdsBy()
: Groups CDS (coding sequences) by transcript or
gene
These will return a GRangesList
where each
element contains the exons of a specific transcript.
exonsBy(x = TxDb.Hsapiens.UCSC.hg19.knownGene, by = "tx")[1:2]
## GRangesList object of length 2:
## $`1`
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chr1 11874-12227 + | 1 <NA> 1
## [2] chr1 12613-12721 + | 3 <NA> 2
## [3] chr1 13221-14409 + | 5 <NA> 3
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
##
## $`2`
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chr1 11874-12227 + | 1 <NA> 1
## [2] chr1 12595-12721 + | 2 <NA> 2
## [3] chr1 13403-14409 + | 6 <NA> 3
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
We can change the by
parameter to group differently:
By transcript = tx
By gene = gene
TxDb Object
as a GTF/GFF fileIf we need to export a TxDb object
(gene annotation
data) into a GTF
or GFF
file, we can use the
export.gff()
function from the rtracklayer
package.
export.gff(TxDb.Hsapiens.UCSC.hg19.knownGene, con = "file.gtf", format ="gtf")
export.gff(TxDb.Hsapiens.UCSC.hg19.knownGene, con = "file.gff", format ="gff")
GTF
(Gene Transfer Format) is a widely used annotation
format in gene expression analysis.
GFF
(General Feature Format) is a more general format used
in genomic annotations.