Centering and scaling predictors
This seems to be no longer necessary, now that jtools has a function called “summ” that allows you to get different output than the old-school “summary” call. These outputs include centering or scaling predictors, confidence intervals, and robust standard errors.
Reverse coding continuous variables
This is the ad skepticism scale. For some reason, the most skeptical people score a 1 out of 7. That doesn’t make sense to me, so I’m going to reverse code all of the items. That way, the most skeptical person will score a 7. Basically all you have to do is subtract 8 from each score and take the absolute value, or flip it so that negative becomes positive.
mydata <- mydata %>% filter(skep9 >= 1) %>% # filter out incompletes mutate_at(vars(skep1,skep2,skep3,skep4,skep5, skep6,skep7,skep8,skep9), funs(abs(.-8)))
Coding and recoding categorical variables
For categorical variables, it can get confusing. Take gender for example: Was 1 male, or was 1 female? I can’t keep that stuff straight, so I will create a variable called “female” and I will set the response to be 0 for male, 1 for female, and NA for other. That way it is impossible to mix it up. Sometimes, though you have a dataset that has variables that are not optimally coded for your analysis, so you have to change “male” to 0 or 0 to “male.” This is how you can go from “gender = male/female/other” to female = 1/0/NA using dplyr.
mydata <- mydata %>% mutate(female = recode(gender, "male" = 0, "female" = 1, "other" = NA))
Reordering categorical variables
The lm function in R will automatically dummy code categorical variables, but it sets the order of the factor to be alphabetical. This is not always ideal. Using the code below, you can set the factors in the order you want. The baseline or control should be listed first. These levels should exactly match current values of the variable.
mydata <- tibble(myfactor = sample(c("Baseline","Treatment"),25, replace = TRUE)) mydata <- mydata %>% mutate(myfactor = factor(myfactor, levels = c("Baseline", "Treatment")))
Sometimes you have a scale such as the ad skepticism scale mentioned earlier that needs to be aggregated together for analysis. This typically means you average across all nine items for each participant. The first step is to remove all of the incomplete observations in your data if you haven’t already done that.
The most straightforward way is to simply add them all up and divide by the total:
mydata <- mydata %>% mutate(skepticism = (skep1 + skep2 + skep3 + skep4 + skep5 + skep6 + skep7 + skep8 + skep9)/9)
You can also select all of the columns at once by matching on the “skep” part of the variable and you can apply the rowMeans function. This function will likely not work if there are problems with your data such as not being numerical, or missing data.
mydata <- mydata %>% mutate(skepticism = rowMeans(select(.,matches("skep"))))
Make sure your data is in the right format and remove missing values.
# check the structure of your variable str(mydata$skep1) # convert all to numeric except for name column mydata <- mydata %>% mutate_at(vars(-name), funs(as.numeric(.)) # remove anyone who hasn't completed the last item mydata <- mydata %>% filter(skep9 >= 1)
Checking your work
I am often transforming or recoding variables and I need to check that it worked. The easiest way to check is to get a sample of about 20 rows that includes both new and old variables to compare them side by side. In order to do this, I just select the columns that I want then take a sample using dplyr.
mydata %>% select(oldvar1,oldvar2,newvar1,newvar2) %>% sample_n(20)
Another cool trick I’ve learned is that you can change the default number of rows output by a tibble. I often want to get more than 10 rows so that I can check my data:
options(tibble.print_max = 30, tibble.print_min = 30)
It’s really important to check your data and be sure it’s right prior to conducting any analysis, and this is a good way to do it.