Data handling and curation - from raw to clean data

Data handling & curation

Managing the “data life-cycle” within and beyond a project.

  • creating data
  • organising data
  • maintaining data

Data cleaning vs Data wrangling

Data cleaning is the process of removing incorrect, duplicate, or otherwise erroneous data from a dataset

Data wrangling changing the format to make it more useful for your analysis

Some R Functions that help data cleaning

{janitor}

library(tidyverse) 
library(palmerpenguins) 
library(janitor)  
palmerpenguins::penguins_raw |>  names() 
 [1] "studyName"           "Sample Number"       "Species"            
 [4] "Region"              "Island"              "Stage"              
 [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
[10] "Culmen Length (mm)"  "Culmen Depth (mm)"   "Flipper Length (mm)"
[13] "Body Mass (g)"       "Sex"                 "Delta 15 N (o/oo)"  
[16] "Delta 13 C (o/oo)"   "Comments"           
janitor::clean_names(palmerpenguins::penguins_raw) |>  names()
 [1] "study_name"        "sample_number"     "species"          
 [4] "region"            "island"            "stage"            
 [7] "individual_id"     "clutch_completion" "date_egg"         
[10] "culmen_length_mm"  "culmen_depth_mm"   "flipper_length_mm"
[13] "body_mass_g"       "sex"               "delta_15_n_o_oo"  
[16] "delta_13_c_o_oo"   "comments"         

{dplyr}

clean_penguins<-janitor::clean_names(palmerpenguins::penguins_raw)  

dplyr::glimpse(clean_penguins)
Rows: 344
Columns: 17
$ study_name        <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
$ sample_number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
$ region            <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ stage             <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, …
$ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
$ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
$ date_egg          <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200…
$ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…
$ delta_15_n_o_oo   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,…
$ delta_13_c_o_oo   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, …
$ comments          <chr> "Not enough blood for isotopes.", NA, NA, "Adult not…

{skimr}

out<-skimr::skim(palmerpenguins::penguins_raw) 

out
Data summary
Name palmerpenguins::penguins_…
Number of rows 344
Number of columns 17
_______________________
Column type frequency:
character 9
Date 1
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
studyName 0 1.00 7 7 0 3 0
Species 0 1.00 33 41 0 3 0
Region 0 1.00 6 6 0 1 0
Island 0 1.00 5 9 0 3 0
Stage 0 1.00 18 18 0 1 0
Individual ID 0 1.00 4 6 0 190 0
Clutch Completion 0 1.00 2 3 0 2 0
Sex 11 0.97 4 6 0 2 0
Comments 290 0.16 18 68 0 10 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Date Egg 0 1 2007-11-09 2009-12-01 2008-11-09 50

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sample Number 0 1.00 63.15 40.43 1.00 29.00 58.00 95.25 152.00 ▇▇▆▅▃
Culmen Length (mm) 2 0.99 43.92 5.46 32.10 39.23 44.45 48.50 59.60 ▃▇▇▆▁
Culmen Depth (mm) 2 0.99 17.15 1.97 13.10 15.60 17.30 18.70 21.50 ▅▅▇▇▂
Flipper Length (mm) 2 0.99 200.92 14.06 172.00 190.00 197.00 213.00 231.00 ▂▇▃▅▂
Body Mass (g) 2 0.99 4201.75 801.95 2700.00 3550.00 4050.00 4750.00 6300.00 ▃▇▆▃▂
Delta 15 N (o/oo) 14 0.96 8.73 0.55 7.63 8.30 8.65 9.17 10.03 ▃▇▆▅▂
Delta 13 C (o/oo) 13 0.96 -25.69 0.79 -27.02 -26.32 -25.83 -25.06 -23.79 ▆▇▅▅▂

{dplyr}

#Deduplication  
data<-tibble(Day=c("Monday", "Tuesday","Wednesday", "Wednesday"), 
             Person=c("Becks", "Amy", "Matt", "Matt"))
data
# A tibble: 4 × 2
  Day       Person
  <chr>     <chr> 
1 Monday    Becks 
2 Tuesday   Amy   
3 Wednesday Matt  
4 Wednesday Matt  
data |> dplyr::distinct()
# A tibble: 3 × 2
  Day       Person
  <chr>     <chr> 
1 Monday    Becks 
2 Tuesday   Amy   
3 Wednesday Matt  

{naniar}

# creating some data with missing values
missing_penguins<-missMethods::delete_MCAR(clean_penguins, 0.3) 


missing_penguins |> 
naniar::vis_miss()

Missing data

Missing data is normally a problem. Typically as ecologists we sweep missing data under the carpet by using a “complete case” approach to data analysis.

If you have ever written some code like this:

# na.omit() 
df <- na.omit(df)  
# complete.cases() 
df <- df[complete.cases(df), ]   
# rowSums() 
df <- df[rowSums(is.na(df)) == 0, ]   
# drop_na() 
df <- df  |>  tidyr::drop_na()

you are removing missing data (NAs) from your dataset.

Why is this a problem?

By throwing away potentially useful data (only including those rows without a NA in them) you reduce the information you are working with, reduce statistical power and introduce selection bias (invalidating any assumption of randomisation).

Different types of missingness

There are three broad categories of missing data:

  • MCAR - missingness is not related to any measured or unmeasured variables

  • MAR - missingness is not random but related to other variables and can be accounted for by another complete variable

  • MNAR - missingness is related to the missing data itself (there is a systematic reason why the data are missing within a particular variable)

Imagine that we are measuring rainfall at weather stations across Norway every year.

Table 1: Missing data patterns - MCAR
(a) Complete data
station number rainfall
1 30
2 150
3 75
4 250
5 55
(b) MCAR
station number rainfall
1 30
2
3
4 250
5 55

Table 2: Complete data
station number rainfall
1 30
2 150
3 75
4 250
5 55
Table 3: MAR
station number rainfall
1
2
3
4 250
5 55

Missing data patterns - MAR

Table 4: Complete data
station number rainfall
1 30
2 150
3 75
4 250
5 55
Table 5: MNAR
station number rainfall
1 30
2
3 75
4
5 55

Missing data patterns - MNAR

What effect does missingness have?

library(tidyverse, quietly = TRUE) 
library(missMethods, quietly = TRUE) 
library(palmerpenguins)

#create datasets with levels missingness

penguins_complete<-penguins |> 
  drop_na()

miss_penguins_MCAR<-
  missMethods::delete_MCAR(penguins_complete, 
                                             0.3, "flipper_length_mm") 
# create a pattern of missingness with censoring 
#(missing value in flipper_length 
#if body_mass is below 30% quantile of body_mass)

miss_penguins_MAR<-
  missMethods::delete_MAR_censoring(penguins_complete,
                                                     0.3, "flipper_length_mm", "body_mass_g") 
# create a pattern of missingness with censoring 
#(missing value in flipper_length 
#if flipper_length is below 30% quantile) 

miss_penguins_MNAR<-
  missMethods::delete_MNAR_censoring(penguins_complete, 
                                                       0.3, "flipper_length_mm")

all_data<-bind_rows("Full"=penguins_complete, 
                    "MCAR"= miss_penguins_MCAR,
                    "MAR"= miss_penguins_MAR,
                    "MNAR"=miss_penguins_MNAR, 
                    .id="Missingness")

What can we do about missing data?

With MCAR and MAR we can use multiple imputation techniques

# Load mice
library(mice, quietly = TRUE)

# Set seed for reproducibility
set.seed(123)

# Simulate data: location, year, count
locations <- rep(1:5, each = 5)  # Assuming 5 locations
years <- rep(2009:2023, 5)  # Assuming data for 5 years
count <- round(rpois(25, lambda = 20))  # Simulated count data

# Create a dataframe
data <- data.frame(Location = locations, Year = years, Count = count)

# Introduce missingness - Missing completely at random (MCAR)
prop_missing <- 0.2  # Example: 20% missingness
missing_indices <- sample(1:nrow(data), prop_missing * nrow(data))
data$Count[missing_indices] <- NA

# Check the structure of the data
str(data)
'data.frame':   75 obs. of  3 variables:
 $ Location: int  1 1 1 1 1 2 2 2 2 2 ...
 $ Year    : int  2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 ...
 $ Count   : num  17 25 12 20 27 NA 14 NA 25 21 ...

When we impute the missing data we use 5 replicates and then take the mean of all 5 datasets.

# Impute missing data using MICE
imp <- mice(data, m = 5, method = 'pmm', seed = 500)

 iter imp variable
  1   1  Count
  1   2  Count
  1   3  Count
  1   4  Count
  1   5  Count
  2   1  Count
  2   2  Count
  2   3  Count
  2   4  Count
  2   5  Count
  3   1  Count
  3   2  Count
  3   3  Count
  3   4  Count
  3   5  Count
  4   1  Count
  4   2  Count
  4   3  Count
  4   4  Count
  4   5  Count
  5   1  Count
  5   2  Count
  5   3  Count
  5   4  Count
  5   5  Count

sem() instead of lm()

Using a Structured Equation Model (SEM) we can run a simple linear regression.

library(lavaan)
# we will use the Iris dataset for this example
complete_model<-lm(Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length, data=iris)

iris_MCAR<-missMethods::delete_MCAR(iris, 0.3, "Sepal.Width")

miss_model1<-lm(Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length, data=iris_MCAR)

=====================================================================
                                   Dependent variable:               
                    -------------------------------------------------
                                       Petal.Width                   
                              (1)                      (2)           
---------------------------------------------------------------------
Sepal.Length               -0.207***                 -0.187**        
                            (0.048)                  (0.057)         
Sepal.Width                 0.223***                 0.219***        
                            (0.049)                  (0.057)         
Petal.Length                0.524***                 0.516***        
                            (0.024)                  (0.029)         
Constant                     -0.240                   -0.315         
                            (0.178)                  (0.215)         
---------------------------------------------------------------------
Observations                  150                      105           
R2                           0.938                    0.931          
Adjusted R2                  0.937                    0.929          
Residual Std. Error     0.192 (df = 146)         0.204 (df = 101)    
F Statistic         734.389*** (df = 3; 146) 454.577*** (df = 3; 101)
=====================================================================
Note:                                   *p<0.05; **p<0.01; ***p<0.001

miss_model2 <- sem('Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length', data=iris_MCAR, missing="ML")
summary(miss_model2)
lavaan 0.6-18 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

                                                  Used       Total
  Number of observations                           105         150
  Number of missing patterns                         1            

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  Petal.Width ~                                       
    Sepal.Length     -0.187    0.056   -3.344    0.001
    Sepal.Width       0.219    0.056    3.945    0.000
    Petal.Length      0.516    0.028   18.133    0.000

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .Petal.Width      -0.315    0.211   -1.493    0.135

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .Petal.Width       0.040    0.006    7.246    0.000

Adding the argument ‘missing = “ML”’ to the sem() function estimates a likelihood function for each row based on the variables that are present so that all the available data are used.

Data validation

library(data.validator)  
report <- data_validation_report()  

between <- function(a, b) {
  function(x) { a <= x & x <= b }
}

validate(iris, name = "Verifying flower dataset") |> 
  validate_if(Sepal.Length > 0, description = "Sepal length is greater than 0") |> 
  validate_cols(between(0, 4), Sepal.Width, description = "Sepal width is between 0 and 4") |> 
  add_results(report)

print(report)
Validation summary: 
 Number of successful validations: 1
 Number of validations with warnings: 0
 Number of failed validations: 1

Advanced view: 


|table_name               |description                    |type    | total_violations|
|:------------------------|:------------------------------|:-------|----------------:|
|Verifying flower dataset |Sepal length is greater than 0 |success |               NA|
|Verifying flower dataset |Sepal width is between 0 and 4 |error   |                3|

print(report)
Validation summary: 
 Number of successful validations: 1
 Number of validations with warnings: 0
 Number of failed validations: 1

Advanced view: 


|table_name               |description                    |type    | total_violations|
|:------------------------|:------------------------------|:-------|----------------:|
|Verifying flower dataset |Sepal length is greater than 0 |success |               NA|
|Verifying flower dataset |Sepal width is between 0 and 4 |error   |                3|

Messy data

Look at the messy data dataset. How would you go about cleaning this dataset?

messy_data <- readRDS(here::here("OS_2024/dataHandling/data/messy_data.RDS")) 

messy_data |> data.table::data.table() 
           Date    SEX      age    Species Name       lat      lon count
         <char> <char>   <char>          <char>     <num>    <num> <int>
  1: 2023/02/01   male    Adult    Lagopus matu  9.123826 62.92628     5
  2:   02/02/23 Female Juvenile    Logopus muta  7.836517 61.70676     9
  3: 2023/02/01   male        J    Logopus muta  9.387262 63.84034    15
  4: 01-01-2023 female    ADULT Lagopus lagopus  9.609423 61.06334    14
  5: 01-01-2023   Male    ADULT    Lagopus matu  8.531654 63.44823    10
  6: 2023-02-01   Male        A    Lagopus muta 10.369053 62.60819    10
  7:   02/02/23 FEMALE JUVENILE    Lagopus muta  9.297162 62.71758    11
  8:   02-01-23 FEMALE    Adult Lagopus lagopus  6.479778 61.59151    15
  9: 2023-02-01 female Juvenile    Lagopus muta  7.879991 63.10879     4
 10:   02/02/23   MALE JUVENILE    Lagopus muta  6.794386 62.41039    14
 11: 2023/02/01 Female    ADULT    Lagopus matu 10.075820 62.14845     6
 12:   02-01-23   MALE        J Lagopas lagopus  9.368355 64.29196    13
 13: 2023/02/01 Female    ADULT    Lagopus matu 10.933222 63.08134     8
 14: 01-01-2023   Male        A    Logopus muta  9.508031 61.63579     7
 15:   02/02/23   MALE JUVENILE    Logopus muta 10.925062 62.59583     4
 16:   02/02/23 female        A Lagopus lagopus  9.279532 62.91511    11
 17:   02/02/23   male        J Lagopus lagopus  9.139805 65.96236    10
 18: 2023-02-01   male Juvenile Lagopas lagopus 10.119368 62.39573    15
 19:   02-01-23   MALE    Adult    Lagopus matu  7.847859 64.10893    10
 20: 2023-02-01   male JUVENILE    Lagopus muta 10.732284 64.42712    13
 21: 2023/02/01 Female        A Lagopas lagopus  8.541997 62.14247    14
 22:   02/02/23   male        J    Logopus muta  9.220419 62.62880    13
 23:   02-01-23 female JUVENILE Lagopus lagopus  9.624997 63.91689    13
 24: 01-01-2023 female    Adult Lagopas lagopus  8.769459 62.61416     9
 25: 01-01-2023 FEMALE    ADULT    Lagopus matu  9.011462 64.40237     2
 26:   02-01-23   MALE    ADULT Lagopas lagopus  8.343305 63.35766    19
 27: 2023/02/01   Male JUVENILE Lagopus lagopus  7.015809 62.97256    10
 28: 2023-02-01   MALE    ADULT Lagopas lagopus  7.249466 63.64190     5
 29:   02/02/23 FEMALE    Adult Lagopas lagopus  9.762441 61.59733     6
 30: 2023/02/01 Female Juvenile Lagopas lagopus  9.081167 62.09558     6
 31: 2023/02/01   MALE JUVENILE    Logopus muta  8.544149 64.01272    15
 32: 2023/02/01 FEMALE JUVENILE Lagopas lagopus  9.115873 64.69849     9
 33: 01-01-2023   male    ADULT    Lagopus matu  7.281479 63.76351    12
 34: 2023/02/01 female JUVENILE    Lagopus muta 10.688551 63.98931    15
 35:   02-01-23   Male Juvenile Lagopus lagopus  9.927145 62.94457     7
 36:   02-01-23   male Juvenile    Logopus muta  8.498186 62.40878     9
 37:   02-01-23 female Juvenile    Lagopus matu  8.993657 64.06921     0
 38: 2023-02-01 female    ADULT    Logopus muta  8.400518 63.67579     3
 39: 2023/02/01   Male    ADULT    Lagopus matu  8.671143 63.79903    18
 40: 01-01-2023 female        J    Lagopus matu  9.794615 61.83442     5
 41: 2023/02/01   MALE        A    Logopus muta  7.662123 63.20559    18
 42: 01-01-2023   male        A    Logopus muta  7.971045 64.32591    15
 43: 01-01-2023 female        J    Lagopus matu  8.777006 62.43949    11
 44:   02/02/23   Male Juvenile    Lagopus muta  8.228000 62.71882    16
 45: 01-01-2023 Female        J    Lagopus muta  8.912136 61.42190     5
 46: 01-01-2023   MALE Juvenile    Lagopus muta 10.286843 62.89537     7
 47: 2023/02/01   male    Adult    Logopus muta  9.195381 61.35946     6
 48:   02-01-23 Female        J    Lagopus matu  8.801430 62.92860     2
 49:   02/02/23 female        J Lagopas lagopus 11.148595 63.51963    10
 50:   02/02/23 female    ADULT Lagopas lagopus  7.861651 62.37019    12
 51: 2023-02-01 female        A    Lagopus muta  9.238828 62.88834     5
 52: 2023/02/01 female JUVENILE    Logopus muta  9.126381 63.92296    13
 53:   02-01-23 female Juvenile Lagopas lagopus 10.303237 62.86576     5
 54: 01-01-2023 female        A Lagopas lagopus  8.745229 62.17313    18
 55:   02-01-23 female        A    Logopus muta  8.691021 62.10421     6
 56:   02/02/23 female JUVENILE Lagopas lagopus  9.314176 62.15080    15
 57: 01-01-2023 Female Juvenile    Logopus muta  7.122930 62.32509    11
 58: 2023-02-01 Female        A Lagopas lagopus  9.117661 61.81606    18
 59: 01-01-2023   male        A Lagopas lagopus  9.800076 63.52459     3
 60: 2023/02/01 Female JUVENILE    Lagopus matu  9.263243 63.46659    12
 61:   02-01-23 Female    Adult Lagopas lagopus  9.432039 61.53335     7
 62: 2023-02-01   Male    Adult    Lagopus matu  7.910848 62.92433     1
 63:   02/02/23   Male        J    Logopus muta  9.092678 61.07257    13
 64: 2023/02/01   male        A Lagopus lagopus 10.627507 62.46044     9
 65: 2023/02/01   MALE        A    Lagopus matu  8.575023 62.45981    13
 66: 01-01-2023 female JUVENILE Lagopas lagopus 10.736562 64.37288     5
 67: 01-01-2023   MALE    ADULT Lagopas lagopus 10.644320 64.10796    14
 68:   02-01-23   MALE    ADULT Lagopus lagopus  8.442710 65.27996    12
 69:   02-01-23 FEMALE Juvenile Lagopas lagopus  7.970893 62.21394     6
 70:   02/02/23   male Juvenile    Lagopus muta  7.002591 62.35749     7
 71: 01-01-2023 FEMALE    ADULT Lagopus lagopus  9.516319 63.04347     6
 72: 01-01-2023 Female        A    Lagopus muta  8.313179 64.23023    12
 73: 2023-02-01 female        A Lagopas lagopus  7.678440 63.40045     6
 74:   02-01-23   male    ADULT    Lagopus matu  8.729332 60.78910     5
 75: 01-01-2023 Female    Adult    Logopus muta  9.753151 63.08899     8
 76:   02/02/23 FEMALE    Adult    Lagopus matu  7.873569 62.33233     8
 77: 01-01-2023   male Juvenile Lagopas lagopus  8.963270 62.15332     9
 78:   02/02/23   male    ADULT Lagopus lagopus  8.428166 64.55505     4
 79:   02/02/23 Female        A Lagopas lagopus  9.471109 61.53559    11
 80: 01-01-2023   MALE        J    Lagopus matu  9.236367 62.27483    12
 81: 2023-02-01 female    ADULT    Lagopus muta  8.939548 63.26092     3
 82:   02-01-23 female    ADULT    Logopus muta  9.227044 61.22400     5
 83: 2023-02-01 female JUVENILE    Lagopus muta  9.318695 62.20958    11
 84:   02/02/23 female Juvenile    Lagopus matu  8.708364 63.88290     7
 85:   02/02/23 female JUVENILE Lagopas lagopus  8.872903 63.71204     8
 86: 2023/02/01 female    Adult    Lagopus matu  7.703743 63.25247     1
 87: 2023-02-01 Female        A    Lagopus matu  9.127351 62.30380    11
 88: 2023-02-01 Female    Adult    Lagopus matu  9.523766 60.64250     8
 89: 2023/02/01 female        J Lagopas lagopus  9.181112 61.92349    12
 90:   02/02/23   MALE Juvenile Lagopas lagopus  7.702984 61.91841     4
 91:   02-01-23 female        J    Logopus muta  8.141424 63.80680     9
 92:   02-01-23   male Juvenile    Lagopus matu 11.014120 64.89251    18
 93: 01-01-2023   MALE JUVENILE    Lagopus muta  6.444234 64.92426     7
 94: 01-01-2023   male        J    Logopus muta  8.507807 63.49817     5
 95:   02-01-23 Female Juvenile    Lagopus matu  9.379068 62.59566     8
 96:   02-01-23   Male        J Lagopus lagopus  8.190334 61.91115    13
 97:   02-01-23   male        A    Lagopus matu  8.489750 63.48030     7
 98:   02/02/23 FEMALE    Adult    Lagopus muta 10.193635 62.98715    15
 99:   02/02/23   MALE        J    Logopus muta  8.798305 63.96787     5
100: 2023-02-01   MALE        A    Lagopus matu  7.846161 63.42758    14
           Date    SEX      age    Species Name       lat      lon count

Data wrangling

Transform to long format

library(tidyverse) 
# this avoids tidyverse conflicts with the base function filter 
conflicted::conflict_prefer("filter", "dplyr") 
# Pivot longer  
penguins_long<-penguins |>
  pivot_longer(contains("_"),names_to = c("part", "measure" , "unit"),names_sep = "_")  

penguins_long 
# A tibble: 1,376 × 8
   species island    sex     year part    measure unit   value
   <fct>   <fct>     <fct>  <int> <chr>   <chr>   <chr>  <dbl>
 1 Adelie  Torgersen male    2007 bill    length  mm      39.1
 2 Adelie  Torgersen male    2007 bill    depth   mm      18.7
 3 Adelie  Torgersen male    2007 flipper length  mm     181  
 4 Adelie  Torgersen male    2007 body    mass    g     3750  
 5 Adelie  Torgersen female  2007 bill    length  mm      39.5
 6 Adelie  Torgersen female  2007 bill    depth   mm      17.4
 7 Adelie  Torgersen female  2007 flipper length  mm     186  
 8 Adelie  Torgersen female  2007 body    mass    g     3800  
 9 Adelie  Torgersen female  2007 bill    length  mm      40.3
10 Adelie  Torgersen female  2007 bill    depth   mm      18  
# ℹ 1,366 more rows

Transform to wide format

penguins_long |>
  pivot_wider(names_from = species, values_from = value) 
# A tibble: 92 × 9
   island    sex     year part    measure unit  Adelie    Gentoo Chinstrap
   <fct>     <fct>  <int> <chr>   <chr>   <chr> <list>    <list> <list>   
 1 Torgersen male    2007 bill    length  mm    <dbl [7]> <NULL> <NULL>   
 2 Torgersen male    2007 bill    depth   mm    <dbl [7]> <NULL> <NULL>   
 3 Torgersen male    2007 flipper length  mm    <dbl [7]> <NULL> <NULL>   
 4 Torgersen male    2007 body    mass    g     <dbl [7]> <NULL> <NULL>   
 5 Torgersen female  2007 bill    length  mm    <dbl [8]> <NULL> <NULL>   
 6 Torgersen female  2007 bill    depth   mm    <dbl [8]> <NULL> <NULL>   
 7 Torgersen female  2007 flipper length  mm    <dbl [8]> <NULL> <NULL>   
 8 Torgersen female  2007 body    mass    g     <dbl [8]> <NULL> <NULL>   
 9 Torgersen <NA>    2007 bill    length  mm    <dbl [5]> <NULL> <NULL>   
10 Torgersen <NA>    2007 bill    depth   mm    <dbl [5]> <NULL> <NULL>   
# ℹ 82 more rows

What’s going on?

No identifier for each observation so R puts all the values in a list. To solve this we need a unique row id.

penguins_long |>
  mutate(sample=row_number()) |>
  pivot_wider(names_from = species,values_from = value) 
# A tibble: 1,376 × 10
   island    sex     year part    measure unit  sample Adelie Gentoo Chinstrap
   <fct>     <fct>  <int> <chr>   <chr>   <chr>  <int>  <dbl>  <dbl>     <dbl>
 1 Torgersen male    2007 bill    length  mm         1   39.1     NA        NA
 2 Torgersen male    2007 bill    depth   mm         2   18.7     NA        NA
 3 Torgersen male    2007 flipper length  mm         3  181       NA        NA
 4 Torgersen male    2007 body    mass    g          4 3750       NA        NA
 5 Torgersen female  2007 bill    length  mm         5   39.5     NA        NA
 6 Torgersen female  2007 bill    depth   mm         6   17.4     NA        NA
 7 Torgersen female  2007 flipper length  mm         7  186       NA        NA
 8 Torgersen female  2007 body    mass    g          8 3800       NA        NA
 9 Torgersen female  2007 bill    length  mm         9   40.3     NA        NA
10 Torgersen female  2007 bill    depth   mm        10   18       NA        NA
# ℹ 1,366 more rows

Tidy data

Tidy data principles

Untidy data

Have a look at the smallGame dataset.

smallGame <- readRDS(here::here("OS_2024/dataHandling/data/smallGame.RDS"))
  • What format is it in (long or wide)?

  • How would you convert it to the other format?

  • Which format do you find easier to use? (there is no “correct” answer to this one!)

Code style

What are attributes of good code?

Have a look at the “notReproducible.qmd” file

Tidyverse style guide

“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.”

# Examples 
day_one # Good 
DayOne # Bad 
# Avoid names of common functions
T <- FALSE # Bad 
c <- 10 # Bad 
mean <- function(x) sum(x) # Bad 
# Space after a comma 
x[, 1] # Good 
x[,1] # Bad 
# space after () in functions 
function(x) {}# Good 
function (x) {} # Bad 
function(x){} # Bad 

Useful packages for style

library(lintr)
# Define the code as a character vector
code <- c(
  "# Spacing ",
  "average<-mean(feet/12 + inches,na.rm=TRUE)",
  "sqrt(x ^ 2 + y ^ 2)",
  "x <- 1 : 10",
  "base :: get",
  "# Indenting",
  "if (y < 0 && debug)",
  "  message('Y is negative')",
  "# Assignment",
  "x = 5"
)

# Write the code to 'bad_style.R'
writeLines(code, "bad_style.R")

# Run lintr on the newly created file
lint("bad_style.R")
C:\Users\matthew.grainger\Documents\Projects_in_development\OS-course-teaching-material\OS_2024\dataHandling\bad_style.R:1:10: style: [trailing_whitespace_linter] Trailing whitespace is superfluous.
# Spacing 
         ^
C:\Users\matthew.grainger\Documents\Projects_in_development\OS-course-teaching-material\OS_2024\dataHandling\bad_style.R:2:8: style: [infix_spaces_linter] Put spaces around all infix operators.
average<-mean(feet/12 + inches,na.rm=TRUE)
       ^~
C:\Users\matthew.grainger\Documents\Projects_in_development\OS-course-teaching-material\OS_2024\dataHandling\bad_style.R:2:19: style: [infix_spaces_linter] Put spaces around all infix operators.
average<-mean(feet/12 + inches,na.rm=TRUE)
                  ^
C:\Users\matthew.grainger\Documents\Projects_in_development\OS-course-teaching-material\OS_2024\dataHandling\bad_style.R:2:32: style: [commas_linter] Commas should always have a space after.
average<-mean(feet/12 + inches,na.rm=TRUE)
                               ^
C:\Users\matthew.grainger\Documents\Projects_in_development\OS-course-teaching-material\OS_2024\dataHandling\bad_style.R:2:37: style: [infix_spaces_linter] Put spaces around all infix operators.
average<-mean(feet/12 + inches,na.rm=TRUE)
                                    ^
C:\Users\matthew.grainger\Documents\Projects_in_development\OS-course-teaching-material\OS_2024\dataHandling\bad_style.R:8:11: style: [quotes_linter] Only use double-quotes.
  message('Y is negative')
          ^~~~~~~~~~~~~~~
C:\Users\matthew.grainger\Documents\Projects_in_development\OS-course-teaching-material\OS_2024\dataHandling\bad_style.R:10:3: style: [assignment_linter] Use <-, not =, for assignment.
x = 5
  ^

library("styler") 
style_file("bad_style.R")
Styling  1  files:
 bad_style.R ℹ 
────────────────────────────────────────
Status  Count   Legend 
✔   0   File unchanged.
ℹ   1   File changed.
✖   0   Styling threw an error.
────────────────────────────────────────
Please review the changes carefully!
readLines("bad_style.R")
 [1] "# Spacing"                                        
 [2] "average <- mean(feet / 12 + inches, na.rm = TRUE)"
 [3] "sqrt(x^2 + y^2)"                                  
 [4] "x <- 1:10"                                        
 [5] "base::get"                                        
 [6] "# Indenting"                                      
 [7] "if (y < 0 && debug) {"                            
 [8] "  message(\"Y is negative\")"                     
 [9] "}"                                                
[10] "# Assignment"                                     
[11] "x <- 5"                                           

What I think…

  1. It runs (on your computer)

  2. It runs (on my computer - without me having to do anything/much)

  3. It does what you expect it to do (even after 5 years)

  4. It is documented in some way

What Jenny Bryan thinks

If the first line of your R script is

setwd("C:\Users\jenny\path\that\only\I\have")

I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.

If the first line of your R script is

rm(list = ls())

I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.