This brief guide will help you to obtain a dataset for Statistics 101. On the tabbed pages, you can download subsets of the General Social Survey and the American National Elections Study.
You may also visit the searchable Duke Data and Visualization Services collection page, which contains more than 450 datasets that cover a wide variety of topics.
Most data sources contain a mixture of numerical and categorical variables, but may not explicitly state this on the web page or the documentation. We advise looking at the documentation or summary statistics for the variables in your chosen dataset to make this determination. The following three tabs present three data collections that have been popular in Stats 101.
The Brandaleone Family Lab for Data and Visualization Services 226 Perkins maintains software (and has staff) that can assist in moving data into R. Ease of access to the data in R should not be the primary factor in determining a datasource- but it should be a factor in selecting variables for your project. Once more, we are happy to advise!
Importing Data and Extracting Sets of Variables
Identify Variables of Interest
Each of the datasets contains a codebook in R Markdown format. Once knit, this document provides information for each of the variables in the dataset.
Peruse the codebook for variables of interest, and be certain to check how many observations for each variable that are not missing. This will ensure that you have a large number of observations that are not missing for all of the variables.
Once identified, be certain to write down the variable names, which are blue and linked in the knit codebook.
Download the Data
If you have not done so, download the data file (a ZIP archive), and unzip it into a directory of your choosing. This directory is ideally set aside only for your R projects.
The data will be a “.csv” file. CSV files are text files where variables are separated by commas.
Import the Data into R
Once the data are unzipped, open RStudio. You will first need to upload the data to the internet (bottom right pane of the R Studio . Once done, the top right pane will allow you to import the data into R.
After selecting the correct file, click next. Be certain to provide a name for the dataset (this is a temporary step, because you will only want a portion of the data). Also, check “Yes” for the Heading option.
If you are running the desktop verison, the following two commands set a working directory, and import the data, which are located in that working directory.
setwd("C:/Documents/R/") gss <- read.csv("gss_sda_nomissing.csv")
Finally, the following command will open the data in a viewer. It's a good idea to check to see that the data imported correctly.
As you will want only a few variables, you may want to create a subset that contains only those variables. This is easy to do in R.
In this example, I have imported the General Social Survey as a new data frame, which I have called gss. I have also identified four variables that I wish to keep, caseid, year, age, and educ.
The following command will create a new object, called data, which will be composed of my four variables, all of which are located in the gss data frame.
data <- gss[c("caseid", "year", "age", "educ")]
The basic syntax is as follows:
new_name <-old_name[c(“variable1”, “variable2”, “variable3”)]
If successful, you will have a new data frame that contains the same number of observations, but fewer variables. From this point, the new data frame will be the one to use.
A Short Guide to Codebooks
A dataset is typically accompanied with a codebook. Codebooks provide documentation of the dataset contents. Codebooks do not usually contain more detailed information regarding the survey or the sample.
It is important to check the codebook for the variables of interest. It will tell you
- what each code represents, and
- it will provide the numbers of cases that are present or missing.
Sometimes, codebooks will provide the exact phrasing of the question, but please refer to the survey documentation for question phrasing and sample selection procedures.
Codebooks usually contain
- The variable name, to which one refers to a variable in a statistics program.
- The variable label, which describes the contents of the variable in a short form.
- The variable data type, which is typically text-based (string variable) or a number (numeric variable).
In addition, if the data are categorical, the codebook contains
- A list of each unique value that is contained by the variable, typically accompanied by the frequency count of each value.
If the data are continuous, the codebook contains
A range of values contained by the variable, along with the minimum and maximum values.
Finally, the codebook will often provide a count of the numbers of missing cases and the missing value codes employed in the dataset.
The variables in the datasets selected for this class generally contain few missing codes, and all missing values have been corrected for R.
Example - General Social Survey
The following codebook entry comes from the General Social Survey and does not contain a frequency for each answer. Descriptions are italicized.
INCOME06 - variable name
TOTAL FAMILY INCOME - variable label
In which of these groups did your total family income, from all sources, fall last year before taxes, that is. - longer form of the interview question
- list of each unique value contained by this variable
|1.4||129||1||UNDER $1 000|
|1.2||112||2||$1 000 TO 2 999|
|0.9||87||3||$3 000 TO 3 999|
|0.6||54||4||$4 000 TO 4 999|
|0.9||83||5||$5 000 TO 5 999|
|1.1||105||6||$6 000 TO 6 999|
|1.4||128||7||$7 000 TO 7 999|
|2.1||192||8||$8 000 TO 9 999|
|3.9||362||9||$10000 TO 12499|
|3.4||317||10||$12500 TO 14999|
|3.3||308||11||$15000 TO 17499|
|2.5||231||12||$17500 TO 19999|
|3.8||348||13||$20000 TO 22499|
|3.6||333||14||$22500 TO 24999|
|5.2||481||15||$25000 TO 29999|
|5.6||518||16||$30000 TO 34999|
|5.4||493||17||$35000 TO 39999|
|9.1||836||18||$40000 TO 49999|
|8.0||734||19||$50000 TO 59999|
|9.7||891||20||$60000 TO 74999|
|7.5||693||21||$75000 TO $89999|
|6.1||564||22||$90000 TO $109999|
|4.1||379||23||$110000 TO $129999|
|2.6||237||24||$130000 TO $149999|
|6.5||595||25||$150000 OR OVER|
Data type: numeric - variable data type
Missing-data codes: 0,26-99 - list of codes used for missing values (0 and 26 through 99)
Properties Data type: numeric Missing-data codes: 0,26-99 Mean: 16.61 Std Dev: 5.74 Record/columns: 1/240-241