2025-03-07

Exercises on basic structures

Make a dataframe called exam_rates from the following vectors:

name=c("Marie", "Gianni", "Silvia", "Laura","Mariachiara", "Simone", "Sarah", "Francesca", "Matteo")

id=c(8452, "AE12", 6732, "AS54", "GF49", 9328, "DS34", 3476, 7628)

pointsTest1=c(12, 15, 14, 5, 18, 3, 10, 7, 16)
  
pointsTest2=c(10, 9, 10, 7, 10, 3, 9, 8, 1)

pointsTest3=c(1, 2, 2, 2, 2, 1, 1, 0, 0)

Create a new column score containing sum of pointsTest1, pointsTest2 and pointsTest3 columns and a column average with the mean of these 3 columns. Hint: see functionalities of rowSums(), colSums(), rowMeans() and colMeans().

Create a new logical column passed containing TRUE if the score is >17 and FALSE otherwise. Hint: remember the theoretical part on logical vectors.

Evaluate how many people has a certain score. How many people has obtained the same score? Hint: use table( ) function.

Create a new dataframe exam_rates2 by selecting from the previous dataframe only people that passed the exam and columns name,pointsTest1,pointsTest2 and pointsTest3

Melt the dataframe and reassign it to exam_rates2. Inspect the structure of the dataframe. What happened?

Create a vector vec by extracting value column from the dataset exam_rates2. Then evaluate sum, mean, standard deviation, summary and quantiles of the vector. Hint: use sum( ), mean( ), sd( ), summary() and quantile() functions. Take some time to inspect results

Create a list collection that contains exam_rates, exam_rates2, vec and the quantiles of vec.

Give a name to each element of the list.

Create a matrix student_marks by binding following vectors as rows.

Marie=c(28, 29, 30, 27, 30)
Gianni=c(18, 19, 21, 26, 28)
Silvia=c(23, 24, 26, 30, 22)
Laura=c(30,29, 30, 28, 29)
Mariachiara=c(29, 30, 28, 27, 25)
Simone=c(24, 29, 30, 22, 19)
Sarah=c(26, 28, 25, 29, 30)
Francesca=c(18, 26, 28, 19, 26)
Matteo=c(26, 28, 30, 30, 28)

Give column names to the matrix: CourseA, CourseB, CourseC, CourseD, CourseE

Add the matrix to the collection list and name it. Hint: remember how to concatenate lists; lists can also contain one single element

You have the following vector with ages of the students:

age=c(21, 24, 22, NA, NA, 20, 23, NA, NA)
names(age)=c("Marie", "Gianni", "Silvia", "Laura", "Mariachiara", "Simone", "Sarah", "Francesca", "Matteo")

extract names of people for which age information is missing
extract names of people for which age information is available
Assign to missing value in the vector the mean of available values. Hint: to do this task you can select only non NA values or explore parameters in mean() function. Try in both ways!
Using grep or grepl functions extract ages of people with “ia” in their names.

Consider the following dataframe:

info=cbind.data.frame(name=c("Marie", "Gianni", "Silvia", "Laura", "Mariachiara", "Simone", "Sarah", "Francesca", "Matteo"),
                       goes_to_gym=c(TRUE, TRUE, FALSE, NA, FALSE, NA, FALSE, TRUE, TRUE),
                       times_gym= c(3, NA,0, NA,0,NA, 0, NA, 5 ),
                       likes=c("ski", "snowboard", "pilates", "football", "tennis", "basketball", "tennis", NA, NA))

Extract people that have missing values in goes_to_gym
Extract people that have missing values in times_gym
Extract people that have missing values in likes
Extract people for which all informations are available. Do this task both using subset function and conditions into [ ]

Make a list info that contains the following elements (l, hour,station and mt ). Be sure to assign names to the elements.

l=cbind.data.frame(Date=c("03-04", "03-05", "03-06", "03-07", "03-08", "03-09", "03-10", "03-11", "03-12"),
                   Temp=c(12,5,18, 20, 12, 15, 17, 15, 19))


hour=c(12, 4, 14, 11, 13, 16, 12.30, 20, 13.30 )


station=c(rep("Saluzzo(CN)", 2), rep("Montecatini(PT)", 3),"Pescia(PT)", rep("Milano(MI)", 3))


mt=rbind(c(13,12,35), c(2.6, 6.2, 9), seq(30, 50, 6.7))

Without extracting elements from the list modify them with the following steps (you cannot use the original objects but only those in the list):

Assign to hour vector names by using values in the Temp column of the dataframe l
Transform station in a factor vector fixing levels that correspond to the alphabetical order of the cities
Add a column in l that contains values in station
Assign to hour an attribute station containing character values of station. Notice that I said character, not factor
Order station vector
Add to the list another vector New containing Temp values extracted matching station and the new column you added to the data frame (keep only the first element)
Evaluate if in the Temp column there are values in the range 19:40
Extract the indexes of the elements that are equal to 15 in hour
Create a new logical vector by evaluating in station the cities that are not in the “PT” province. Hint: explore the paramenters of the function you will use. Then add this vector to the list
How many elements does your list contain?
Create a new matrix mt2 of the same dimensions as mt that contains mt values if they are minor than 10 and mt values +1 if they are major or equal to 10. Add this to the list.
Using only a combination of [[ ]] and [ ] extract the second element of the fifth element of the list
Extract the second column of the mt matrix from the list
Extract all elements, except the second, of the list
Extract the data frame contained into the list and assign it to a variable data_frame
Subset data_frame by keeping only data for the stations “Milano(MI)” and “Saluzzo(CN)” and reassign it to the variable data_frame
Concatenate data_frame with the following dataframe data2 and reassign the result to data_frame

data2=cbind.data.frame(Date=c("03-07", "03-08", "03-09", "03-10", "03-11", "03-12", "03-04", "03-05", "03-06", "03-09", "03-10", "03-11", "03-12"),
                   Temp=c(11, 12, 14, 9, 8, 13, 14, 15, 6, 10, 18,14, 13),
                   station=c(rep("Milano(MI)", 6), rep("Saluzzo(CN)",7)))

Using table() evaluate how many times a certain Temp is retrieved in each city
Using dcast expand the data frame using as reference columns Date and station (reassign to data_frame)
Assign as rownames values in the column Date
Remove the column Date from the dataframe
Transform the dataframe into a matrix
Identify names of rows in which Milano and Saluzzo have the same values
Tranform the matrix by substituting “a” to values equal to 13. What happens to the matrix? Evaluate str() and typeof()
What happens if you try evaluating which elements are >10?

Create a dataframe df1 using the vectors samples, genes and expression_counts (binding by columns) and an other dataframe df2 using vectors FOXA1, MYC, AR, ENSG00000281133, FAM138A (binding by rows). Assign the following colnames to df2: gene_name, gene_type, gene_id.

Then:
- add to df1 colums gene_id and gene_type by taking the information in df2
- extract the unique names of the protein coding genes from df1
- create a vector cases by grepping from df1 the string “MOD” and keeping only unique values
- create a vector controls by grepping from df1 the string “NORM” and keeping only unique values
- assign name “case” to all the elements in cases vector
- assign name “control” to all the elements in controls vector
- create a new vector samples by concatenating cases and controls
- add a column annotation to df1 in which you have to put “case” and “control” values by matching the column samples in df1 with values in the vector samples

samples=c(rep("MOD01",5), 
          rep("NORM01",5), 
          rep("NORM02", 5), 
          rep("MOD02", 5), 
          rep("MOD03", 5),
          rep("NORM03", 5) )

genes=rep(c("FOXA1","AR","MYC","ENSG00000281133","FAM138A"), 6)

expression_counts=rnorm(30, 25, 4)

FOXA1=c("FOXA1", "protein_coding","ENSG00000129514.8")
AR=c("AR","protein_coding","ENSG00000169083.18")
MYC=c("MYC","protein_coding","ENSG00000136997.21")
ENSG00000281133=c("ENSG00000281133","pseudogene" ,"ENSG00000281133.1")
FAM138A=c("FAM138A", "lncRNA","ENSG00000237613.2")

Create a function get_info that takes as input a number and a dataframe and return as output:

“LOW” if the number is lower than 10
“HIGH” if the number is greater than 40
the number of values in column expression_counts of df1 (previous exercise) that are lower than that number

Then:
- create the vector gene_name = c("FOXA1","SRSF1","MYC","PTBP1","AR"). Then, use a for loop to iterate over the elements in the vector. For each element, print “Present” if it is found in the genes column of df1, or “Not Present” otherwise. (HINT: try to use "\n" to write each output in a new line)
- use a for loop to iterate over the elements in the gene_name vector. If an element is not found in the genes column of df1, skip to the next element. Otherwise, select the rows corresponding to that gene and display the average of the expression_counts column.
- create a variable total<-0. Use a for loop to iterate over the elements in the expression_counts column of df1. At each iteration, add the new value to total. The loop should terminate once total exceeds 150. Display the final value of total.

Suppose you have the following list g:
- Extract from the matrix in the list all rows that do not contain missing values
- Sum by columns, by row and evaluate the mean by column and by row of the original matrix. Explore options in the functions you will use to exclude missing values and NaN (use the helper in R Studio).
- Remove from the matrix in the list column 5 if it contains NA, NaN or infinite values.
- Replace the matrix by removing the rows that contain missing values
- Identify indexes of the vector in the list that contain infinite values

g=list(Value=c(NaN,32, NA,39, Inf, -Inf, 8.9, 4 ),
       Mat=matrix(c(1:9, NA, NaN, 989:103, Inf, NA, 10^7, 9^5, 6*7, 5/3, 6+2, 5-7), ncol=6, byrow = T),
       Df=cbind.data.frame(place=factor(c("Garden", "House", "Square", NA)), N=c(NA, 5, 7, NaN))
       
       )