This is a companion glossary for a previous post on working with large data sets. Its purpose is to highlight the relevant arguments for dealing with reading and working with large data sets.
- Loading the data file:
- Set directory
- Standard form
- setwd(“…”)
- Mac
setwd(“/Users/UserName/Documents/FOLDER”)
GitHub Gist: set_dir_mac.R
- Windows
setwd(“C:/Users/UserName/Documents/FOLDER”)
GitHub Gist: set_dir_windows.R
- Simplified
setwd(“~/FOLDER”)
GitHub Gist: set_dir_simplified.R - this replaces “C:/Users/UserName/Documents”
- Additional information
- Standard form
- Read file
- read table
- read.table(‘file’, header=TRUE, sep = “ ”)
- Works with .txt and .csv
- ex:
Temp <- read.table('Temp.csv', header=TRUE, sep = ",")
GitHub Gist: read_table.R
- ex:
- Works with .txt and .csv
- read.table(‘file’, header=TRUE, sep = “ ”)
- read csv
- read.csv(‘file’)
- functions the same as read.table but specifically for .csv files
- ex:
Temp <- read.csv('Temp.csv', header=TRUE, sep = ",")
GitHub Gist: read_csv.R
- ex:
- functions the same as read.table but specifically for .csv files
- read.csv(‘file’)
- read excel file
- read.xls(‘file’, sheetname = ‘sheet’)
- similar to read.csv but for .xlsx files
- need to load package that can read .xlsx files
# install and load packages to read xlxs files install.packages("xlsx") library(xlsx)
GitHub Gist: xlsx.R
- need to specify tab being read in
- ex:
P <- read.xlsx('Precip_Basin.xlsx', startRow = 4, header = TRUE, sep = ",", sheetName = 'SUP_mm')
GitHub Gist: read_xlsx.R
- ex:
- read.xls(‘file’, sheetname = ‘sheet’)
- Choose file
- read.csv(file.choose())
- to choose files manually
- read.csv(file.choose())
- Scan
- scan(‘file’, what =, sep = “ “)
- scan file should not be used for larger data sets
- scan(‘file’, what =, sep = “ “)
- Additional information
- Extensive descriptions of reading data
- read table
- Set directory
- Cleaning up the data set:
- Blank values
- NA values
- na.omit(data)
- ex:
new_data <- na.omit(data)
GitHub Gist: na_omit.R - This creates a new data set identical to the original without the NA values
- ex:
- na.exclude(data)
- Functions similar to na.omit(data)
- Both return object with rows containing NAs removed
- na.fail(data)
- Checks for NAs and returns the tested object if none are found
- na.pass(data)
- Passes over NAs to return the object unchanged
- na.omit(data)
- Additional information
- More information with examples
- NA values
- Converting factors an numerics
- as.character(data)
- converts all factors to character strings
- ex:
P[,1]<-as.character(P[,1])
GitHub Gist: character.R
- ex:
- Additional information
- converts all factors to character strings
- as.numeric(data)
- converts all factors or characters into a numeric vector
- same function as as.double and as.real
- ex:
P[,1]<-as.numeric(P[,1])
GitHub Gist: numeric.R
- ex:
- Additional information
- as.character(data)
- Blank values
- Using the data:
- plotting
- Basic plot function
- plot(x,y)
- Other useful arguments inside the plot function
- type = ” ”
- plot type
- ie “l” for line plot
- main = ‘ ‘
- main plot title
- xlab = ‘ ‘
- x-axis title
- ylab = ‘ ‘
- y-axis title
- col = ” ”
- line color
- ie. “blue”
- lwd = <num>
- linewidth/thichkness
- type = ” ”
- Arguments outside plot function
- grid()
- turns on grid lines for plot
- par(new = T/F)
- for plotting multiple lines or data on one plot
- T if plotting another line after/F if not
- grid()
- ex:
# plot Annual Precipitation plot(year,ann, type = "l", main = 'Annual Lake Superior Basin Precipitation', xlab = 'Year', ylab = 'Precipitation(mm)', col = "blue", lwd = 2) grid() par(new = F)
GitHub Gist: plot.R - Additional information
- Basic plot function
- plotting
One thought on “Glossary for Working with Data sets”