2025-03-05

Theorical intro: CpG islands

The CpG sites (i.e. 5’C—phosphate—G3’) are regions of DNA where a cytosine is followed by a guanine nucleotide in the linear sequence of bases along its 5’ → 3’ direction.

Often, the cytosines in CpG dinucleotides are methylated (5-methylcytosines). In mammals, 70% to 80% of CpG cytosines are methylated. Methylating the cytosine within a gene can influence its expression.

CpG islands (also called CG-Rich Islands) are regions with a high frequency of CpG sites. In humans, about 70% of promoters located near the transcription start site of a gene contain a CpG island.

Input/Output

Read CpG island (CpGi) data contained in CpGi.table.hg18.csv (You find it in Datasets folder). This file was downloaded from the compGenomRData package and is a comma-separated file.

Store it in a variable called cpgi
By applying the str( ) function, explore the cpgi dataframe.

Evaluate the dimensions of the dataframe.

Visualize the first rows of the dataset by using head( ) function.

What happens if you set stringsAsFactors=TRUE in the reading function? Use again the str( ) function.

Read only the first 10 rows of the CpGi table.

Read the file skipping the first 10 lines of the CpGi table.

Try reading the file setting header=FALSE. What happens?

Read again the file using the optimal options to import the data correctly.

Write CpG islands to a RDS file. Set the output folder as you prefer. Notice that if you want write the file to your home folder you can use file="~/filename.rds" as in linux ~/ denotes home folder (Notice that if you are using Windows you have to use back slash (\) instead of slash (/).

Save CpG islands in a txt file. Make sure to use the quote=FALSE , sep="\t" and row.names=FALSE arguments. What do these arguments do?

Read the RDS file you created. Extract CpG islands only on chr1 and assign them to a variable called chr1. HINT: subset cpg1 using both [] (creating a logical vector with == operator) and subset() function.

Create a variable called chr2 with CpG islands only on chr2. Save both chr1 and chr2 in an RData file. Then, remove both chr1 and chr2 variables from R environment using rm(). What happens if you try to visualize chr1 now? Inspect the “Environment” tab.

Load the Rdata file you created and inspect the content. What happens if you try to visualize the header of chr1??

In your environment you have “chr1” and “chr2” again. Let’s work on them!
- Create a vector length_chr1 that contains values in column length in chr1
- Create a vector length_chr2 that contains values in column length in chr2
- Evaluate quantile distribution with steps of 0.1 of length_chr1 and length_chr2 (HINT: use quantile() function).
- Evaluate mean, median and standard deviation of length_chr1 and length_chr2. Are there differences between the two? Comment on this, also considering quantiles.
- Create three different dataframes from chr1: chr1_small, chr1_medium and chr1_large by using quantiles values (<=30%, >30% and <=60% and >60%)
- Create three different dataframes from chr2: chr2_small, chr2_medium and chr2_large by using quantiles values (<=30%, >30% and <=60% and >60%)
- How many rows do the dataframes you created have??
- Evaluate how many times a certain name is repeated into chr1_small
- Evaluate how many times a certain name is repeated into chr2_large
- Considering all cpgi dataframe, how many CpGs do you have per chromosome?
- Create a vector casual with numbers normally distributed whose length is equal to the length of unique values of chr1_small$name
- Give names to casual as the unique values of chr1_small$name
- Order casual from the bigger to the smaller value
- Add a column Gaussianin chr1_small by matching the column name with casual names.
- Add a column Gaussian in chr2_small by matching the column name with casual names. Are there missing values? If yes, how many? To how many unique names they correspond?
- Transform chr2_small by excluding rows that have missing values in Gaussian
- Save chr2_small and chr1_small into a Rdata file
- Create a list dat that contains:
  - The subset of chr1_large for which length is major than 25000
  - The number of rows corresponding to values of length in chr2_large that are major than 25000
  - A matrix you create by selecting only the last three columns from chr1_small
- Save the list you created into an RDS file
- Which is the type of the vector contained into the list you created?
- Evaluate if c("CpG: 45","CpG: 52", "CpG: 108") are present in the name column of chr2_medium
- Extract row indexes of the elements in name column in chr2_medium that correspond to c("CpG: 45","CpG: 52", "CpG: 108")
- Extract from the matrix in the dat list all columns except the third and rows that contain perGc values between 53 and 62
- Extract indices of rows into the column name in chr2_large that contain 2
- Create a new logical vector lo evaluating if values in name column in chr1_large contain number 3
- Transform lo in a matrix with 7 columns. You will obtain a warning… why?
- Add the matrix you created to the list dat and overwrite the file you saved before with the new list