Skip to main content

Statistics 101 : Introduction

Selected Resources for students in Statistics 101


This brief guide will help you to obtain a dataset for Statistics 101.  On the tabbed pages, you can download subsets of the General Social Survey and the American National Elections Study.  

You may also visit the searchable Duke Data and Visualization Services collection page, which contains more than 450 datasets that cover a wide variety of topics.

Most data sources contain a mixture of numerical and categorical variables, but may not explicitly state this on the web page or the documentation.  We advise looking at the documentation or summary statistics for the variables in your chosen dataset to make this determination.  The following three tabs present three data collections that have been popular in Stats 101.

The Brandaleone Family Lab for Data and Visualization Services 226 Perkins maintains software (and has staff)  that can assist in moving data into R.  Ease of access to the data in R should not be the primary factor in determining a datasource- but it should be a factor in selecting variables for your project.  Once more, we are happy to advise!

Importing Data and Extracting Sets of Variables

Identify Variables of Interest

Each of the datasets contains a codebook in R Markdown format. Once knit, this document provides information for each of the variables in the dataset.

Peruse the codebook for variables of interest, and be certain to check how many observations for each variable that are not missing. This will ensure that you have a large number of observations that are not missing for all of the variables.

Once identified, be certain to write down the variable names, which are blue and linked in the knit codebook.

Download the Data

If you have not done so, download the data file (a ZIP archive), and unzip it into a directory of your choosing. This directory is ideally set aside only for your R projects.

The data will be a “.csv” file. CSV files are text files where variables are separated by commas.

Import the Data into R

Once the data are unzipped, open RStudio. You will first need to upload the data to the internet (bottom right pane of the R Studio . Once done, the top right pane will allow you to import the data into R.

After selecting the correct file, click next. Be certain to provide a name for the dataset (this is a temporary step, because you will only want a portion of the data). Also, check “Yes” for the Heading option.

If you are running the desktop verison, the following two commands set a working directory, and import the data, which are located in that working directory.

gss <- read.csv("gss_sda_nomissing.csv")

Finally, the following command will open the data in a viewer. It's a good idea to check to see that the data imported correctly.

Subsetting Data

As you will want only a few variables, you may want to create a subset that contains only those variables. This is easy to do in R.

In this example, I have imported the General Social Survey as a new data frame, which I have called gss. I have also identified four variables that I wish to keep, caseidyearage, and educ.

The following command will create a new object, called data, which will be composed of my four variables, all of which are located in the gss data frame.

data <- gss[c("caseid", "year", "age", "educ")]

The basic syntax is as follows:

new_name <-old_name[c(“variable1”, “variable2”, “variable3”)]

If successful, you will have a new data frame that contains the same number of observations, but fewer variables. From this point, the new data frame will be the one to use.

A Short Guide to Codebooks

A dataset is typically accompanied with a codebook. Codebooks provide documentation of the dataset contents. Codebooks do not usually contain more detailed information regarding the survey or the sample.

It is important to check the codebook for the variables of interest. It will tell you

  • what each code represents, and
  • it will provide the numbers of cases that are present or missing.

Sometimes, codebooks will provide the exact phrasing of the question, but please refer to the survey documentation for question phrasing and sample selection procedures.

Codebooks usually contain

  • The variable name, to which one refers to a variable in a statistics program.
  • The variable label, which describes the contents of the variable in a short form.
  • The variable data type, which is typically text-based (string variable) or a number (numeric variable).

In addition, if the data are categorical, the codebook contains

  • A list of each unique value that is contained by the variable, typically accompanied by the frequency count of each value.

If the data are continuous, the codebook contains

  • A range of values contained by the variable, along with the minimum and maximum values.

  • Finally, the codebook will often provide a count of the numbers of missing cases and the missing value codes employed in the dataset.

  • The variables in the datasets selected for this class generally contain few missing codes, and all missing values have been corrected for R.

Example - General Social Survey

The following codebook entry comes from the General Social Survey and does not contain a frequency for each answer. Descriptions are italicized.

INCOME06 - variable name

TOTAL FAMILY INCOME - variable label

In which of these groups did your total family income, from all sources, fall last year before taxes, that is. - longer form of the interview question

- list of each unique value contained by this variable

Percent N Value Label

1.4 129 1 UNDER $1 000
1.2 112 2 $1 000 TO 2 999
0.9 87 3 $3 000 TO 3 999
0.6 54 4 $4 000 TO 4 999
0.9 83 5 $5 000 TO 5 999
1.1 105 6 $6 000 TO 6 999
1.4 128 7 $7 000 TO 7 999
2.1 192 8 $8 000 TO 9 999
3.9 362 9 $10000 TO 12499
3.4 317 10 $12500 TO 14999
3.3 308 11 $15000 TO 17499
2.5 231 12 $17500 TO 19999
3.8 348 13 $20000 TO 22499
3.6 333 14 $22500 TO 24999
5.2 481 15 $25000 TO 29999
5.6 518 16 $30000 TO 34999
5.4 493 17 $35000 TO 39999
9.1 836 18 $40000 TO 49999
8.0 734 19 $50000 TO 59999
9.7 891 20 $60000 TO 74999
7.5 693 21 $75000 TO $89999
6.1 564 22 $90000 TO $109999
4.1 379 23 $110000 TO $129999
2.6 237 24 $130000 TO $149999
6.5 595 25 $150000 OR OVER

  46,510 0 IAP
  860 26 REFUSED
  481 98 DK

100.0 57,061   Total

Data type: numeric - variable data type

Missing-data codes: 0,26-99 - list of codes used for missing values (0 and 26 through 99)

Record/columns: 1/69-70

Properties Data type: numeric Missing-data codes: 0,26-99 Mean: 16.61 Std Dev: 5.74 Record/columns: 1/240-241

Ryan Denniston

Ryan Denniston's picture
Ryan Denniston

Perkins 233
PO Box 90175
Durham, NC 27708
Website / Blog PageSkype Contact