Sunday, August 4, 2013

A quick tutorial in R

I've had a request for a quick tutorial on how to get started using R.  R is a statistical language that analyzes data and saves results in files called "objects."  You can then use the objects to create new analyses and objects.  Note: Wherever you see a "#", what follows is a comment.  You can delete that comment before running the code or leave it in—R ignores anything that follows a # sign.  Here's how I get my students started.


First, get your data organized in a spreadsheet program.  Here's a sample of how to do it, set up for ANOVA.

Treatment
Data
1
10
1
12
1
8
2
4
2
13
2
14
3
9
3
14
3
19

Each individual measurement or observation within each independent variable gets its own row.  If you have multiple independent variables (Treatment, Sex, etc.), each independent variable gets its own column.  Then export your data as a csv file or as a tab-delineated text file (I personally prefer csv files).

Second, download the proper version of R from the R Project website and install it.  There are a variety of versions available, so you should be able to find one for your machine.  After opening R, you'll get the standard start screen.


The first task is to set the working directory so R will know where your csv file is saved.  For Macs, all you need to do go to Misc >> Change working directory, then find the folder your csv document is in and click "Open."  For Windows, go to File >> Change dir.  Or if you prefer the text command, which works on all computers, it's
setwd(type in the pathway to your file here)
Once you set the working directory, type in
Object.Name = read.csv("name of your file.csv", header=TRUE)
and hit return.  Note the "." between "Object" and "Name".  R does not like spaces in object names.  This will create a new object filled with your data in your R workspace.  Use the "summary" command to get some basic statistics:
summary(Object.Name)
This will give you the minimum, 1st quartile, median, mean, 3rd quartile, maximum, and the number of missing values (if any).

The basic linear analysis is
Analysis.name = lm(Data~Treatment, data = Object.Name)
summary(Analysis.name)
The "summary" command will give you the coefficients, standard errors, t-test statistics, p-values, R2 values, F-statistic, and overall p-value for the analysis.

You can get diagnostic plots for your analysis by typing the following two commands:
layout(matrix(c(1,2,3,4),2,2)) #create a 2 x 2 plot
plot(Analysis.name)
And to graph the results of your analysis, type the following:
plot(Data~Treatment, data = Object.name, xlab="x-axis label", ylab = "y-axis label", main = "Plot title")
abline(Analysis.name, col="red", lwd = 2) #add the regression line to the graph IF your independent variable is continuous.  Otherwise, skip this step.
If you're running linear regression between one dependent and one independent variable, then that's all you need.  If you're running an ANOVA, then the next commands are
Object.anova = anova(Analysis.name)
summary(Object.anova)
The alternate way to do ANOVA in R is the aov command
Object.aov = aov(Data~Treatment, data = Object.Name)
summary(Object.aov)
No ANOVA is complete without a post-hoc test.  For the Tukey HSD test, the command is
TukeyHSD(Object.aov)
Note: The TukeyHSD test only works with the aov command, not the lm command.  Also, if you only have two Treatment groups, the ANOVA or AOV command will give the same results as a student t-test but without most of the assumptions of a student t-test.  Student t-test assumes that the data is normally distributed.  The only assumption in ANOVA is that the residuals, not the data, are normally distributed.

Multiple Regression and ANOVA

The command for multiple linear regression or ANOVA with multiple independent variables is similar.

No interactions between independent variables:
Analysis.name = lm(Y ~ X1 + X2, data = Object.name)
summary(Analysis.name)
With interactions between independent variables:
Analysis.name = lm(Y ~ X1 * X2, data = Object.name)
summary(Analysis.name)
To change it to the anova table:
Analysis.anova = anova(Analysis.name)
summary(Analysis.anova)
Diagnostic plots and graphing the results of your analysis are the same as in the simple linear analysis.

Time Series

R has multiple ways of dealing with time series like global temperature datasets.  The most basic analysis possible is done using the "lm" command.  For instance, say you're interested in the linear trend in monthly UAH temperature data since 1998.  Assuming that your Time variable is already in year.decimal month format, the commands are
UAH.lm = lm(UAH~Time, data = Object.name, subset = Time >= 1998)
summary(UAH.lm)
layout(matrix(c(1,2,3,4),2,2))
plot(UAH.lm) #diagnostic plots
plot(UAH~Time, data = Object.name, subset = Time >= 1998, type = "l", lwd = 1, col = "red", lty = 1, xlab="Time (years)", ylab = "Temperature anomalies (ºC)", main = "UAH global temperature data since 1998") #time series plot
abline(UAH.lm, col = "red", lwd = 2)

More resources for R

Now, I can hear those of you who are already R fanatics yelling about all that I've left out.  There's a reason for that: This article is only intended to help someone with no idea of how to use R get started.  There are already multiple websites available for learning much more about R and I'm not about to try to create yet another such site.  Instead, here are a couple of sites I recommend if you want to learn far more about R:

Quick-R, a great website for learning R.

R Tutorial, which teaches basic and advanced statistical techniques using R.

Cookbook for R, which gives solutions for a wide variety of R related problems.

No comments:

Post a Comment