---
title: "ICRAFuseR seminar series"
subtitle: "Data types, data structures and data import"
author: "Aida Bargues Tobella"
date: "21 November 2019 "
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## Before we start...
- Create an R project called "ICRAFuseR seminars" (In RStudio go to File / New Project...)
- Create a folder called "Data" within the folder where you have your R project
- Go to the ICRAFuseR slack workspace, and get the data files (hotdogs, swimming_pools, vanGeldern_run1) from the beginners channel. Place them in the Data folder
- Download the file "Intro to R 2.Rmd" into the folder where you have your R project
- Open it in R studio
- Install the following packages: *readr* and *readxl*
## Arithmetic with R
The most common arithmetic operators are:
- Addition: +
- Subtraction: -
- Multiplication: *
- Division: /
- Exponentiation: ^
- Modulo: %% (returns the reminder of the division of one number by another)
## Arithmetic with R | Examples
```{r arithmetic, echo = TRUE}
# An addition
5 + 2
# A subtraction
5 - 2
# A multiplication
5 * 2
```
```{r arithmetic2, echo = TRUE}
# A division
(5 + 2) / 2
# Exponentiation
5 ^ 2
# Modulo
5 %% 2
```
Note how the # symbol is used to add comments on the R code.
## Variables and variable assignment in R
A variable in R allows you to **store** data (e.g., a value, a vector or other objects).
You can assign a value to a variable using <- as follows:
```{r assign variable, echo = TRUE}
var.1 <- 5
var.2 <- 2
```
You can print out the value of the variables:
```{r print variable, echo = TRUE}
var.1
var.2
```
You can use the variables names to access the the data that is stored within these variables:
```{r use vars, echo = TRUE}
var.1 + var.2
```
## Data types in R
R has 5 basic data types:
- Character: Text or string values
- Numeric: Decimal
- Integer: Cannot take decimal or fraction values
- Logical (or Boolean): TRUE or FALSE
```{r data types, echo = TRUE}
char <- "Kenya"
num <- 2.58
int <- 2L
log <- FALSE
```
## Data types in R
You can check the data type of a variable using the **class()** function:
```{r data types2, echo = TRUE}
class(char)
class(num)
class(int)
```
## Data structures in R
R has 5 data structures
- Vector
- Matrix
- Data frame
- List (or Boolean)
- Factors
## Vectors | Create a vector
Vectors are the most common and basic data structure in R. A vector is a collection of elements of the data type (character, logical, integer or numeric).
There are many ways to create a vector. One of them is to specify their content directly using the combine function **c()**, which combines the elements to form a vector:
```{r create vector, echo = TRUE}
char_vec <- c("Kenya", "Tanzania", "Uganda", "Rwanda")
num_vec <- c(3.5, -2.1, 5.3, 2.6)
int_vec <- c(4L, 5L, 2L, 1L)
log_vec <- c(TRUE, FALSE, FALSE, TRUE)
```
## Vectors | Data type
You can check the data type using **class()**
```{r vector class, echo = TRUE}
class(char_vec)
class(num_vec)
class(int_vec)
```
You can check the lenght of the vector using **length()**
```{r vector length, echo = TRUE}
length(num_vec)
```
You can check the structure of the object using **str()**
```{r vector str, echo = TRUE}
str(char_vec)
```
## Vectors | Arithmetic
```{r vector arithmetic, echo = TRUE}
a <- c(2, 4, 8, 16, 32)
b <- c(1, 2, 4, 8, 16)
a + b
a/b
```
## Vectors | Selection
Select an element(s) based on its position within the vector. You should indicate the indices of the elements you want to select within square brackets:
```{r vector sel 1, echo = TRUE}
a
a[3]
a[c(2, 3, 4)]
```
```{r vector sel 2, echo = TRUE}
a
a[2:4]
a[c(1, 3:5)]
a[-2]
```
You can also select vector elements by comparison:
```{r vector sel 3, echo = TRUE}
a
b
a >= 8
b != 16
b == 4
a[a >= 8]
b[a != 8]
b[b == 4]
```
## Matrices
In R, a matrix is a collection of elements of the same data type (numeric, character, integer or logical) arranged into a fixed number of rows and columns.
There are many ways to create a matrix in R. One of them is to combine two vectors of the same data type and lenght. You can use this using the **rbind()** or **cbind()** functions, which combine vectors by row and column, respectively.
```{r vector combine, echo = TRUE}
M1 <- rbind(a,b)
M1
```
```{r vector combine2, echo = TRUE}
M2 <- cbind(a,b)
M2
```
```{r matrix class, echo = TRUE}
class(M1)
class(M2)
```
## Matrices | Selection
You can select elements from a matrix indicating the indices of the element(s) within square brackets: [row,col]
```{r matrix select, echo = TRUE}
M2
M2[2,1]
```
```{r matrix select2, echo = TRUE}
M2
M2[2:4, 1:2]
```
```{r matrix select3, echo = TRUE}
M2
M2[,2] #selects all elements of the second column
M2[3,] #selects all elements of the third row
```
## Matrices | Calculations
You can add all the elements in the rows or colums of a matrix using the functions **rowSums()** and **colSums()**, respectively:
```{r matrix calc, echo = TRUE}
colSums(M2)
rowSums(M2)
```
```{r matrix calc2, echo = TRUE}
M2
2*M2
```
## Data frames
A data frame is an two dimensional (rows and columns) data structure with the variables set as columns and the observations as rows. Unlike matrices, data frames can store different data types.
As an example, we will use a dataset from R called ChickWeight. The ChickWeight data frame has 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks.
The body weights of the chicks were measured at birth and every second day thereafter until day 20. They were also measured on day 21. There were four groups on chicks on different protein diets.
```{r dataframe, echo = TRUE}
chicks_df <- datasets::ChickWeight
class(chicks_df)
```
## Data frames | Overview
You can print the first observations in the data frame using the function **head()**
```{r dataframe2, echo = TRUE}
head(chicks_df)
```
You can also get a rapid overview of your data using **str()**
```{r dataframe str, echo = TRUE}
str(chicks_df)
```
## Factors
Factors are variables that take a **limited** number of values or categories. Such variables are typically called *categorical variables*.
```{r fctr, echo = TRUE}
class(chicks_df$Chick)
class(chicks_df$Diet)
```
## Factors | Levels
```{r levels, echo = TRUE}
levels(chicks_df$Diet)
```
Diet: a factor with levels 1, ..., 4 indicating which experimental diet the chick received.
```{r levels2, echo = TRUE}
levels(chicks_df$Chick)
```
Chick: an ordered factor with levels 18 < ... < 48 giving a unique identifier for the chick. The ordering of the levels groups chicks on the same diet together and orders them according to their final weight (lightest to heaviest) within diet.
## Factors | Summary
How many observations do we have under each level of a factor?
```{r levels3, echo = TRUE}
summary(chicks_df$Chick)
summary(chicks_df$Diet)
```
## Data frames | Selection of elements
Similarly to matrices, you can select element(s) from a data frame indicating the indices of these elements like this: [row(s), column(s)]
```{r dataframe select, echo = TRUE}
chicks_df[1,] #selects all the elements from the first row
```
```{r dataframe select2, echo = TRUE}
chicks_df[1:3, 1] #selects the first column (weight) from rows 1 to 3
```
But it can be easier using the name of the column (or variable)...
```{r dataframe select3, echo = TRUE}
chicks_df[1:3, "weight"] #selects the first column (weight) from rows 1 to 3
```
## Data frames | Selecting an entire column
```{r dataframe selectcol, echo = TRUE}
chicks_df[, 3] #selects the first column (weight) from rows 1 to 3
chicks_df[, "weight"] #selects the first column (weight) from rows 1 to 3
```
## Data frames | Selecting an entire column with $
You can also use $ as a shortcut...
```{r dataframe selectcol2, echo = TRUE}
chicks_df$weight
```
## Importing data into R | read.table()
The function **read.table()** is the most basic importing funcion in R.
```{r read.table, echo = TRUE}
hotdogs_df <- read.table(file = "Data/hotdogs.txt", #the file is located in the
#Data folder within your working directory
header = FALSE, #there are no column names in the first
#row
sep = "\t", #field separator character. Here fields are
#delimited by tabs
col.names = c("type", "calories", "sodium"))
head(hotdogs_df)
```
## Importing data into R | read.table()
```{r read.table2, echo = TRUE}
swimming_df <- read.table(file = "Data/swimming_pools.csv", #the file is located
#in the Data folder within your working directory
header = TRUE, #there are no column names in the first
#row
sep = ",") #field separator character. Here fields are
#delimited by commas (.csv)
head(swimming_df)
```
## Importing data into R | read.csv()
The function read.csv is like read.table but the argument *header* is set to TRUE and *sep* is set to "," by default.
```{r read.csv, echo = TRUE}
swimming_df <- read.csv(file = "Data/swimming_pools.csv")
class(swimming_df)
head(swimming_df)
```
## Importing data into R | read_csv
**read_csv()** is a function from the package **readr** that helps to import data easily into R
```{r read_csv, echo = TRUE, warning = FALSE, message = FALSE}
library(readr)
swimming_df <- read_csv(file = "Data/swimming_pools.csv")
class(swimming_df)
head(swimming_df)
```
One of the main differences is that read.csv() coerces strings to factors.
## Importing data into R | read_excel()
Before importing data from an Excel workbook into R you need to check which sheets are available in the workbook.
You can do this using the function **excel_sheets()** from the library **readxl**.
```{r read_excel, echo = TRUE}
library(readxl)
excel_sheets(path = "Data/vanGeldern_run1.xlsx")
```
## Importing data into R | read_excel()
```{r read_excel2, echo = TRUE}
Picarro_run1 <- read_excel(path = "Data/vanGeldern_run1.xlsx",
sheet = "final values LIMS")
head(Picarro_run1)
class(Picarro_run1)
```