Glossary for Working with Data sets

This is a companion glossary for a previous post on working with large data sets.  Its purpose is to highlight the relevant arguments for dealing with reading and working with large data sets.

  1. Loading the data file:
    1. Set directory
    2. Read file
      • read table
        • read.table(‘file’, header=TRUE, sep = “ ”)
      • read csv
        •  read.csv(‘file’)
          • functions the same as read.table but specifically for .csv files
      • read excel file
        • read.xls(‘file’, sheetname = ‘sheet’)
          • similar to read.csv but for .xlsx files
          • need to load package that can read .xlsx files
            • # install and load packages to read xlxs files
               install.packages("xlsx")
               library(xlsx)
              GitHub Gist: xlsx.R
          • need to specify tab being read in
            • ex: 
              P <- read.xlsx('Precip_Basin.xlsx', startRow = 4, header = TRUE, sep = ",", sheetName = 'SUP_mm')
              GitHub Gist: read_xlsx.R
      • Choose file
        • read.csv(file.choose())
          • to choose files manually
      • Scan
        • scan(‘file’, what =, sep = “ “)
          • scan file should not be used for larger data sets
      • Additional information
      • Extensive descriptions of reading data
  2. Cleaning up the data set:
    1. Blank values
      • NA values
        • na.omit(data)
          • ex: 
            new_data <- na.omit(data)
            GitHub Gist: na_omit.R
          • This creates a new data set identical to the original without the NA values
        • na.exclude(data)
          • Functions similar to na.omit(data)
          • Both return object with rows containing NAs removed
        • na.fail(data)
          • Checks for NAs and returns the tested object if none are found
        • na.pass(data)
          • Passes over NAs to return the object unchanged
      • Additional information
      • More information with examples
    2. Converting factors an numerics
  3. Using the data:
    1. plotting
      • Basic plot function
        • plot(x,y)
      • Other useful arguments inside the plot function
        • type = ” ”
          • plot type
          • ie “l” for line plot
        • main = ‘ ‘
          • main plot title
        • xlab = ‘ ‘
          • x-axis title
        • ylab = ‘ ‘
          • y-axis title
        • col = ” ”
          • line color
          • ie. “blue”
        • lwd = <num>
          • linewidth/thichkness
      • Arguments outside plot function
        • grid()
          • turns on grid lines for plot
        • par(new = T/F)
          • for plotting multiple lines or data on one plot
          • T if plotting another line after/F if not
      • ex:
        # plot Annual Precipitation
        plot(year,ann,
             type = "l",
             main = 'Annual Lake Superior Basin Precipitation', 
             xlab = 'Year',
             ylab = 'Precipitation(mm)', 
             col = "blue", 
             lwd = 2)
        grid()
        par(new = F)
        GitHub Gist: plot.R
      • Additional information

One thought on “Glossary for Working with Data sets

Leave a Reply

Your email address will not be published. Required fields are marked *