Glossary for Working with Data sets

This is a companion glossary for a previous post on working with large data sets. Its purpose is to highlight the relevant arguments for dealing with reading and working with large data sets.

Loading the data file:
1. Set directory
  - Standard form
    - setwd(“…”)
  - Mac
    - setwd(“/Users/UserName/Documents/FOLDER”)
      GitHub Gist: set_dir_mac.R
  - Windows
    - setwd(“C:/Users/UserName/Documents/FOLDER”)
      GitHub Gist: set_dir_windows.R
  - Simplified
    - setwd(“~/FOLDER”)
      GitHub Gist: set_dir_simplified.R
      - this replaces “C:/Users/UserName/Documents”
  - Additional information
2. Read file
  - read table
    - read.table(‘file’, header=TRUE, sep = “ ”)
      - Works with .txt and .csv
        
        ex: Temp <- read.table('Temp.csv', header=TRUE, sep = ",")
        GitHub Gist: read_table.R
  - read csv
    - read.csv(‘file’)
      - functions the same as read.table but specifically for .csv files
        
        ex: Temp <- read.csv('Temp.csv', header=TRUE, sep = ",")
        GitHub Gist: read_csv.R
  - read excel file
    - read.xls(‘file’, sheetname = ‘sheet’)
      - similar to read.csv but for .xlsx files
      - need to load package that can read .xlsx files
        
        # install and load packages to read xlxs files install.packages("xlsx") library(xlsx)
        GitHub Gist: xlsx.R
      - need to specify tab being read in
        
        ex: P <- read.xlsx('Precip_Basin.xlsx', startRow = 4, header = TRUE, sep = ",", sheetName = 'SUP_mm')
        GitHub Gist: read_xlsx.R
  - Choose file
    - read.csv(file.choose())
      - to choose files manually
  - Scan
    - scan(‘file’, what =, sep = “ “)
      - scan file should not be used for larger data sets
  - Additional information
  - Extensive descriptions of reading data
Cleaning up the data set:
1. Blank values
  - NA values
    - na.omit(data)
      - ex: new_data <- na.omit(data)
        GitHub Gist: na_omit.R
      - This creates a new data set identical to the original without the NA values
    - na.exclude(data)
      - Functions similar to na.omit(data)
      - Both return object with rows containing NAs removed
    - na.fail(data)
      - Checks for NAs and returns the tested object if none are found
    - na.pass(data)
      - Passes over NAs to return the object unchanged
  - Additional information
  - More information with examples
2. Converting factors an numerics
  - as.character(data)
    - converts all factors to character strings
      - ex: P[,1]<-as.character(P[,1])
        GitHub Gist: character.R
    - Additional information
  - as.numeric(data)
    - converts all factors or characters into a numeric vector
    - same function as as.double and as.real
      - ex: P[,1]<-as.numeric(P[,1])
        GitHub Gist: numeric.R
    - Additional information
Using the data:
1. plotting
  - Basic plot function
    - plot(x,y)
  - Other useful arguments inside the plot function
    - type = ” ”
      - plot type
      - ie “l” for line plot
    - main = ‘ ‘
      - main plot title
    - xlab = ‘ ‘
      - x-axis title
    - ylab = ‘ ‘
      - y-axis title
    - col = ” ”
      - line color
      - ie. “blue”
    - lwd = <num>
      - linewidth/thichkness
  - Arguments outside plot function
    - grid()
      - turns on grid lines for plot
    - par(new = T/F)
      - for plotting multiple lines or data on one plot
      - T if plotting another line after/F if not
  - ex:
    # plot Annual Precipitation plot(year,ann, type = "l", main = 'Annual Lake Superior Basin Precipitation', xlab = 'Year', ylab = 'Precipitation(mm)', col = "blue", lwd = 2) grid() par(new = F)
    GitHub Gist: plot.R
  - Additional information

One thought on “Glossary for Working with Data sets”

Leave a Reply Cancel reply