The CpG sites (i.e. 5’C—phosphate—G3’) are regions of DNA where a cytosine is followed by a guanine nucleotide in the linear sequence of bases along its 5’ → 3’ direction.
Often, the cytosines in CpG dinucleotides are methylated (5-methylcytosines). In mammals, 70% to 80% of CpG cytosines are methylated. Methylating the cytosine within a gene can influence its expression.
CpG islands (also called CG-Rich Islands) are regions with a high frequency of CpG sites. In humans, about 70% of promoters located near the transcription start site of a gene contain a CpG island.
CpGi.table.hg18.csv
(You find it in Datasets folder). This
file was downloaded from the compGenomRData package
and is
a comma-separated file.cpgi
str( )
function, explore the cpgi
dataframe.head( )
function.stringsAsFactors=TRUE
in the
reading function? Use again the str( )
function.header=FALSE
. What
happens?file="~/filename.rds"
as in linux ~/
denotes home folder (Notice that if you are using Windows you have to
use back slash (\
) instead of slash (/
).txt
file. Make sure to use the
quote=FALSE
, sep="\t"
and
row.names=FALSE
arguments. What do these arguments do?chr1
. HINT: subset cpg1
using both []
(creating a logical vector with
==
operator) and subset()
function.chr2
with CpG islands only on
chr2. Save both chr1
and chr
2 in an RData
file. Then, remove both chr1
and chr2
variables from R environment using rm()
. What happens if
you try to visualize chr1 now? Inspect the “Environment” tab.In your environment you have “chr1” and “chr2” again. Let’s work on them!
length_chr1
that contains values in
column length
in chr1
length_chr2
that contains values in
column length
in chr2
length_chr1
and length_chr2
(HINT: use
quantile()
function).length_chr1
and length_chr2
. Are there
differences between the two? Comment on this, also considering
quantiles.chr1
:
chr1_small
, chr1_medium
and
chr1_large
by using quantiles values (<=30%, >30% and
<=60% and >60%)chr2
:
chr2_small
, chr2_medium
and
chr2_large
by using quantiles values (<=30%, >30% and
<=60% and >60%)name
is repeated into
chr1_smallname
is repeated into
chr2_largecpgi
dataframe, how many CpGs do you
have per chromosome?casual
with numbers normally
distributed whose length is equal to the length of unique values of
chr1_small$name
casual
as the unique values of
chr1_small$name
casual
from the bigger to the smaller valueGaussian
in chr1_small
by
matching the column name
with casual
names.Gaussian
in chr2_small
by
matching the column name
with casual
names.
Are there missing values? If yes, how many? To how many unique names
they correspond?chr2_small
by excluding rows that have
missing values in Gaussian
chr2_small
and chr1_small
into a
Rdata filedat
that contains:
chr1_large
for which length
is major than 25000length
in
chr2_large
that are major than 25000chr1_small
c("CpG: 45","CpG: 52", "CpG: 108")
are
present in the name
column of chr2_medium
name
column in
chr2_medium
that correspond to
c("CpG: 45","CpG: 52", "CpG: 108")
dat
list all columns
except the third and rows that contain perGc
values between
53 and 62name
in
chr2_large
that contain 2lo
evaluating if values in
name
column in chr1_large
contain number
3lo
in a matrix with 7 columns. You will
obtain a warning… why?dat
and
overwrite the file you saved before with the new list