All in One View

Content from The Case for Switching


Last updated on 2026-04-22 | Edit this page

Estimated time: 45 minutes

Overview

Questions

  • Why should I switch from SPSS to R?
  • What can R do that SPSS cannot?
  • How much does SPSS actually cost compared to R?

Objectives

  • Describe the practical advantages of R over SPSS for research
  • See live examples of R capabilities that go beyond SPSS
  • Understand the cost and reproducibility arguments for switching
Cartoon of a researcher at a Caribbean crossroads choosing between a cracked SPSS path and a paved R path leading to the coast
One road costs you a license fee. The other one costs you a learning curve.

Introduction


This episode is a motivational opening. You will not write any code yourself yet — sit back and watch the instructor demonstrate what R can do. By the end, you should have a clear picture of why learning R is worth the investment of your time.

What you will be able to produce

By Friday afternoon you will be able to take a single R Markdown file and produce a polished, multi-page PDF report from it with one command. The worked example we will return to is a baseball statistics report on Xander Bogaerts, Aruba’s MLB star — the kind of document that would ordinarily take an afternoon of copy-paste between Excel, SPSS output, and Word, but that here is one file and one click. You will see the finished report in this opening so you know what you are aiming at. In Episode 6 you will build a smaller version of the same thing yourself.

What you will see

The instructor will demonstrate four things that are impossible or impractical in SPSS:

  1. Pulling a live research dataset — the DCDC Network’s small-island reference list, maintained on GitHub at the University of Aruba — straight into R, no browser required.
  2. Creating a publication-quality chart in under 10 lines of code.
  3. A reproducible report that updates automatically when new data arrives.
  4. An interactive dashboard built entirely in R.

If any of those sound appealing, you are in the right place.

The cost argument


Let us start with the most concrete reason. SPSS is expensive — especially for small island institutions that pay per seat.

SPSS Standard R + RStudio
License type Annual subscription Free, open-source
Cost per user per year USD 1,170 – 5,730 (varies by tier) USD 0
5-year cost for 5 users USD 29,250 – 143,250 USD 0
Runs on Windows, Mac Windows, Mac, Linux, cloud
Updates Paid upgrades Continuous, free

For a university department in the Dutch Caribbean with three SPSS licenses, that is easily AWG 10,000+ per year that could be redirected to research funding, student assistants, or conference travel.

Callout

“But my institution already pays for SPSS”

That is true today. But institutional budgets change, and when you graduate or change jobs, your personal SPSS license disappears. R stays with you forever — on your laptop, on a cloud server, on a Raspberry Pi if you want. Your scripts will still run in 10 years.

What R gives you that SPSS does not


Reproducibility

In SPSS, a typical workflow looks like this: open a dataset, click through menus, copy output into Word, repeat. If your supervisor asks “can you re-run this with the updated data?”, you have to remember every click.

In R, your entire analysis lives in a script. You change one line (the file path) and re-run. Every step is documented.

Packages

SPSS has a fixed set of procedures. R has over 20,000 add-on packages on CRAN alone, covering everything from Bayesian statistics to text mining to geographic mapping. If a method exists, there is probably an R package for it.

Automation

Need to run the same analysis on 50 files? In SPSS, that means 50 times through the menus (or learning SPSS syntax, which few people do). In R, it is a three-line loop.

Communication

R Markdown and Quarto let you combine narrative text, code, and output into a single document — a PDF, a Word file, a website, or a slideshow. This lesson itself was built with R.

Career value

Data science job postings almost never list SPSS. R and Python dominate. Even within academia, journals increasingly expect reproducible code alongside submissions.

Live demonstration


The instructor will now run a live demonstration. Watch the screen.

Callout

What is happening on screen

Do not worry about understanding the code right now. The goal is to see what is possible. You will learn the building blocks starting in the next episode.

This is the complete script to run live. Practice this before the workshop. Make sure tidyverse is installed — everything we need for the demo comes with it.

Step 1: Frame the source before you type

Before the first keystroke, name what the room is about to see. The CSV about to load is in a GitHub repository maintained at the University of Aruba — island-research-reference-data — part of the DCDC Network’s shared infrastructure for island research. It is not a third-party service you hope stays up. It is research data the network owns and curates. That framing matters: the “wow” is not just that R can read a URL. It is that the data layer underneath belongs to us.

Step 2: Pull the SIDS reference list

Open a new R script in RStudio and type (or paste) the following. Run it line by line so participants can watch each step.

R

# Load packages (install tidyverse once, before the workshop)
# install.packages("tidyverse")
library(tidyverse)

# Pull the UA island-research reference list straight from GitHub
countries <- read_csv(
  "https://raw.githubusercontent.com/University-of-Aruba/island-research-reference-data/main/countries/countries_reference_xlsform.csv"
)

# Quick look at what we got
head(countries)

Pause. Point out: “No browser. No download dialog. No save-as. The file is now a live object in my session, with over a dozen columns per country.”

Step 3: Filter to SIDS and chart by region

R

countries |>
  filter(is_sids == 1) |>
  count(wb_region) |>
  ggplot(aes(x = reorder(wb_region, n), y = n)) +
  geom_col(fill = "#44759e") +
  coord_flip() +
  labs(
    title = "Small island developing states by World Bank region",
    x = NULL,
    y = "Number of SIDS"
  ) +
  theme_minimal(base_size = 14)

Pause again. Key talking points:

  • “This chart is ready for a report as-is. Title, axis labels, colour, proportions — all set in code.”
  • “If the UA team adds a country to the reference list tomorrow, I re-run this script and the chart updates. No re-click, no re-export.”
  • “Every editorial choice — what counts as a SIDS, which region goes where — is traceable, because the definitions sit in the source CSV you just pulled.”

Step 4: Show the contrast with SPSS

Ask the audience: “How would you have done this in SPSS?”

Walk through it slowly. Make it sting a little — this is the moment the cost of the current workflow lands.

  1. Go looking for a canonical SIDS list. UN-OHRLLS? UN DESA? A supplementary table from a recent paper? Pick one and hope it is current.
  2. If you cannot find a clean download, email a colleague who might have one saved somewhere. Wait for a reply. Half a day, on a good day.
  3. Open the CSV you eventually receive. Discover that “Cabo Verde” and “Cape Verde” are different strings, that some country codes are ISO-2 and others ISO-3, that one row has a stray trailing comma. Clean by hand.
  4. Import the cleaned file into SPSS. Recode the region variable because whatever classification the file uses does not match World Bank regions.
  5. Analyze > Descriptive Statistics > Frequencies on region. Copy the output table.
  6. Graphs > Chart Builder, drag variables, format the chart, copy, paste into Word.

Then say: “That is half a morning, on a good day, assuming the colleague replies and the file is clean. In R it was six lines and ten seconds — from a dataset the DCDC Network maintains, so the next person who needs it gets the same clean answer.”

Backup plan

If the Wi-Fi is unreliable, the same CSV is saved locally at episodes/data/countries_backup.csv. Swap the read_csv() call for the local path:

R

countries <- read_csv("data/countries_backup.csv")

Then proceed with the filter() |> count() |> ggplot() pipeline as normal.

Summary


You have now seen R:

  • Pull live data from the internet with a single function call
  • Create a publication-ready chart in 10 lines of code
  • Do both of these things in a way that is fully reproducible

Starting in the next episode, you will learn to do these things yourself — one step at a time.

Key Points
  • R is free, open-source, and runs on any operating system
  • R scripts make your analysis fully reproducible
  • R can pull data from APIs, create interactive visualizations, and automate reports — things SPSS cannot do
  • Switching builds on your existing statistical knowledge, not replaces it

Content from Your First R Session


Last updated on 2026-04-22 | Edit this page

Estimated time: 90 minutes

Overview

Questions

  • How does RStudio compare to the SPSS interface?
  • How do I import data, including SPSS .sav files?
  • How do I get descriptive statistics and frequency tables in R?

Objectives

  • Navigate the RStudio interface and identify the equivalent of SPSS panels
  • Import CSV and SPSS .sav files into R
  • Run basic descriptive statistics and frequency tables
  • Inspect variables and data structure
Cartoon of a researcher at a Caribbean beach bar opening his laptop to the R console, with an iguana watching from the counter
The iguana is optional. The coconut water is not.

RStudio orientation


When you open RStudio for the first time, you see four panes. If you have used SPSS before, each one has a rough equivalent:

RStudio pane Location SPSS equivalent What it does
Source Editor Top-left Syntax Editor Where you write and save your code (scripts)
Console Bottom-left Output Viewer Where R runs commands and prints results
Environment Top-right Data View header Lists all objects (datasets, values) currently in memory
Files / Plots / Help Bottom-right (no equivalent) File browser, plot preview, and built-in documentation

The key difference from SPSS: in SPSS, you usually have one dataset open at a time and interact through menus. In RStudio, you write instructions in the Source Editor (top-left), send them to the Console (bottom-left), and the results appear either in the Console or the Plots pane.

Callout

The Source Editor is your new best friend

In SPSS, many users never open the Syntax Editor — they click menus instead. In R, the Source Editor is how you work. Think of it as a recipe: you write the steps once, and you (or anyone else) can re-run them at any time.

Save your scripts with the .R extension. Three reasons you will thank yourself later. RStudio recognises .R files and turns on syntax highlighting, error checking, and the “Run” button. Version control systems like Git track changes line by line in .R files but treat other formats as opaque blobs. And when a colleague opens the file in six months, the extension tells them immediately that this is R code, not a Word document or a loose text file. The extension is small. The habit pays for itself the first time you come back to your own work.

Objects and assignment


In SPSS, when you compute a new variable, it appears as a column in your dataset. In R, everything you create is stored as a named object.

You create objects with the assignment operator <- (a less-than sign followed by a hyphen). Read it as “gets” or “is assigned”.

R

# Store a number
population <- 106739

# Store text (called a "character string" in R)
island <- "Aruba"

# Store the result of a calculation
density <- population / 180  # Aruba is about 180 km²

To see the value of an object, type its name and run it:

R

population

OUTPUT

[1] 106739

R

island

OUTPUT

[1] "Aruba"

R

density

OUTPUT

[1] 592.9944
Callout

Why <- and not =?

You will see some people use = for assignment, and it works in most cases. However, the R community convention is <-. It makes your code easier to read because = is also used inside function arguments (as you will see shortly).

In RStudio, the keyboard shortcut Alt + - (Alt and the minus key) types <- for you automatically.

Functions: R’s version of menu clicks


In SPSS, you click Analyze > Descriptive Statistics > Descriptives and a dialog box appears. In R, you call a function instead. A function has a name, and you pass it arguments inside parentheses.

R

# round() is a function. 592.777 is the input, digits = 1 is an option.
round(592.777, digits = 1)

OUTPUT

[1] 592.8

R

# sqrt() calculates a square root
sqrt(density)

OUTPUT

[1] 24.35148

The pattern is always: function_name(argument1, argument2, ...). This is the R equivalent of filling in an SPSS dialog box — the function name is the menu item, and the arguments are the fields you would fill in.

Packages: extending R


R comes with many built-in functions, but its real power comes from packages — add-on libraries written by other users. Think of them as SPSS modules, except they are free.

There are two steps:

  1. Install the package (once per computer, like installing an app):

R

install.packages("tidyverse")
  1. Load the package (once per session, like opening an app):

R

library(tidyverse)

OUTPUT

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Callout

install.packages() vs library()

A common source of confusion for beginners:

  • install.packages("tidyverse") — downloads and installs the package. You only need to do this once (or when you want to update). Note the quotation marks.
  • library(tidyverse) — loads a package that is already installed so you can use it in your current session. You do this every time you start R. No quotation marks needed (though they work too).

Analogy: install.packages() is buying a book and putting it on your shelf. library() is taking the book off the shelf and opening it.

Importing data


Before you import — set up your workshop folder

R can read a file from the internet, and we saw that in Episode 1. In daily work you will more often read from a file that already lives on your computer — on a shared drive, in a project folder, next to your script. We will do that here.

Three short steps, and then every read_csv() line in the rest of the course will just work.

1. Download the two course datasets. Later in this episode we compare loading the same data from a CSV file and from an Excel file. Download both now. Open each link in your browser and click the Download raw file button near the top right of the preview:

Do not open the CSV in Excel and re-save — that can silently change the encoding. Just save both files as they are.

2. Create an RStudio project. In RStudio, go to File → New Project → New Directory → New Project. Name the directory r-workshop and save it somewhere you can find again (your Documents folder or Desktop is fine). RStudio will open a fresh session with this folder as its working directory.

3. Put the CSV where R expects to find it. Inside your r-workshop project folder, create a subfolder called data (lower-case, no spaces). Move the downloaded aruba_visitors.csv into it. Your structure should look like this:

r-workshop/
├── r-workshop.Rproj
└── data/
    ├── aruba_visitors.csv
    └── aruba_visitors.xlsx

In RStudio’s Files pane (bottom-right), click into the data folder. If you see aruba_visitors.csv, you are ready. Green sticky note.

Callout

Why a project folder?

A project folder answers the single most common beginner error in R: “R cannot find my file.” The file path "data/aruba_visitors.csv" is read relative to R’s current working directory. When you open a project, RStudio automatically sets the working directory to the project folder, so the path works. Without a project, R’s working directory could be anywhere — usually somewhere unhelpful like your Documents folder — and the file is not found.

Projects also keep scripts, data, and outputs organised in one place you can hand to a colleague or archive at the end of an engagement.

CSV files with read_csv()

The most common data format in R is CSV (comma-separated values). The readr package (loaded as part of tidyverse) provides read_csv():

R

visitors <- read_csv("data/aruba_visitors.csv")

OUTPUT

Rows: 120 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): quarter, origin
dbl (7): year, visitors_stayover, visitors_cruise, avg_stay_nights, avg_spen...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

R prints a summary of the column types it detected. This is equivalent to opening a CSV in SPSS via File > Open > Data and checking the variable types.

SPSS .sav files with haven

If you have existing SPSS datasets, the haven package reads them directly — including variable labels and value labels:

R

# install.packages("haven")  # run once if needed
library(haven)
spss_data <- read_sav("path/to/your/file.sav")

This means you do not have to convert your SPSS files to CSV first. R reads them as-is.

Excel files with readxl

Many datasets arrive as Excel files (.xlsx or .xls), especially from government agencies and international organizations. In SPSS, you would import these through File > Open > Data and select the Excel file type from the dropdown. In R, the readxl package handles this.

R

# Iteration: 1
# Install once if needed
install.packages("readxl")

R

# Iteration: 1
library(readxl)

# Basic import -- reads the first sheet by default
visitors_xl <- read_excel("data/aruba_visitors.xlsx")

If your Excel file has multiple sheets, use the sheet argument to specify which one you want – either by name or by position:

R

# Iteration: 1
# By sheet name
stayover <- read_excel("data/aruba_visitors.xlsx", sheet = "stayover")

# By position (second sheet)
cruise <- read_excel("data/aruba_visitors.xlsx", sheet = 2)

You can also read a specific cell range with the range argument, which is useful when the data does not start at cell A1:

R

# Iteration: 1
# Read only cells B2 through F50
subset <- read_excel("data/aruba_visitors.xlsx", range = "B2:F50")
Callout

read_excel() vs read_csv() – when to use which

If you have a choice, CSV is simpler: it is plain text, lightweight, and avoids formatting surprises. Use read_excel() when you receive data in Excel format and do not want to manually export it to CSV first – or when the file contains multiple sheets you need to access programmatically.

Unlike read_csv(), read_excel() is not part of the tidyverse. You need to install and load readxl separately.

Challenge

Challenge: Import from Excel

Suppose you received an Excel file called aruba_visitors.xlsx with two sheets: “stayover” and “cruise”.

  1. Write the code to load the readxl package.
  2. Write the code to read the “cruise” sheet into an object called cruise_data.
  3. How would you check how many rows and columns cruise_data has?

R

# Iteration: 1
# 1: Load the package
library(readxl)

# 2: Read the cruise sheet
cruise_data <- read_excel("aruba_visitors.xlsx", sheet = "cruise")

# 3: Check dimensions
dim(cruise_data)
# Or: glimpse(cruise_data)

Exploring your data


Now that we have the visitors dataset loaded, let us explore it. Each of the functions below is the R equivalent of something you would do in SPSS.

View() — the Data View equivalent

R

View(visitors)

This opens a spreadsheet-like viewer in RStudio, just like SPSS Data View. You can scroll, sort columns by clicking headers, and filter. (Note the capital V.)

head() — see the first few rows

R

head(visitors)

OUTPUT

# A tibble: 6 × 9
   year quarter origin        visitors_stayover visitors_cruise avg_stay_nights
  <dbl> <chr>   <chr>                     <dbl>           <dbl>           <dbl>
1  2019 Q1      United States             72450           45200             6.8
2  2019 Q1      Netherlands               18300            1200            10.2
3  2019 Q1      Venezuela                  8200             400             5.1
4  2019 Q1      Colombia                   6100             800             4.8
5  2019 Q1      Canada                     4500            3200             7.1
6  2019 Q1      Other                      9800            5600             5.5
# ℹ 3 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
#   satisfaction_score <dbl>

This is faster than View() when you just want a quick look. By default it shows 6 rows. You can change that: head(visitors, n = 10).

str() — the Variable View equivalent

In SPSS, you would switch to Variable View to see variable names, types, and labels. In R, str() does the same thing:

R

str(visitors)

OUTPUT

spc_tbl_ [120 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ year               : num [1:120] 2019 2019 2019 2019 2019 ...
 $ quarter            : chr [1:120] "Q1" "Q1" "Q1" "Q1" ...
 $ origin             : chr [1:120] "United States" "Netherlands" "Venezuela" "Colombia" ...
 $ visitors_stayover  : num [1:120] 72450 18300 8200 6100 4500 ...
 $ visitors_cruise    : num [1:120] 45200 1200 400 800 3200 5600 38100 900 300 700 ...
 $ avg_stay_nights    : num [1:120] 6.8 10.2 5.1 4.8 7.1 5.5 6.5 9.8 4.9 4.6 ...
 $ avg_spending_usd   : num [1:120] 1250 980 620 710 1180 890 1220 960 600 690 ...
 $ hotel_occupancy_pct: num [1:120] 82.3 82.3 82.3 82.3 82.3 82.3 78.1 78.1 78.1 78.1 ...
 $ satisfaction_score : num [1:120] 8.1 7.9 7.5 7.7 8 7.6 8 7.8 7.4 7.6 ...
 - attr(*, "spec")=
  .. cols(
  ..   year = col_double(),
  ..   quarter = col_character(),
  ..   origin = col_character(),
  ..   visitors_stayover = col_double(),
  ..   visitors_cruise = col_double(),
  ..   avg_stay_nights = col_double(),
  ..   avg_spending_usd = col_double(),
  ..   hotel_occupancy_pct = col_double(),
  ..   satisfaction_score = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

This tells you: how many observations (rows), how many variables (columns), and the type of each variable (num for numbers, chr for text).

glimpse() — a tidyverse alternative to str()

The glimpse() function from dplyr gives similar information in a tidier format:

R

glimpse(visitors)

OUTPUT

Rows: 120
Columns: 9
$ year                <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 20…
$ quarter             <chr> "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q2", "Q2", "Q…
$ origin              <chr> "United States", "Netherlands", "Venezuela", "Colo…
$ visitors_stayover   <dbl> 72450, 18300, 8200, 6100, 4500, 9800, 65200, 15400…
$ visitors_cruise     <dbl> 45200, 1200, 400, 800, 3200, 5600, 38100, 900, 300…
$ avg_stay_nights     <dbl> 6.8, 10.2, 5.1, 4.8, 7.1, 5.5, 6.5, 9.8, 4.9, 4.6,…
$ avg_spending_usd    <dbl> 1250, 980, 620, 710, 1180, 890, 1220, 960, 600, 69…
$ hotel_occupancy_pct <dbl> 82.3, 82.3, 82.3, 82.3, 82.3, 82.3, 78.1, 78.1, 78…
$ satisfaction_score  <dbl> 8.1, 7.9, 7.5, 7.7, 8.0, 7.6, 8.0, 7.8, 7.4, 7.6, …

summary() — Descriptives in one command

In SPSS: Analyze > Descriptive Statistics > Descriptives. In R:

R

summary(visitors)

OUTPUT

      year        quarter             origin          visitors_stayover
 Min.   :2019   Length:120         Length:120         Min.   :  180
 1st Qu.:2020   Class :character   Class :character   1st Qu.: 4050
 Median :2021   Mode  :character   Mode  :character   Median : 5900
 Mean   :2021                                         Mean   :15948
 3rd Qu.:2022                                         3rd Qu.:17875
 Max.   :2023                                         Max.   :78200
 visitors_cruise   avg_stay_nights  avg_spending_usd hotel_occupancy_pct
 Min.   :    0.0   Min.   : 3.500   Min.   : 400.0   Min.   :12.10
 1st Qu.:  242.5   1st Qu.: 4.575   1st Qu.: 677.5   1st Qu.:70.72
 Median : 1100.0   Median : 5.650   Median : 900.0   Median :77.95
 Mean   : 6624.8   Mean   : 6.220   Mean   : 898.2   Mean   :71.61
 3rd Qu.: 4425.0   3rd Qu.: 6.900   3rd Qu.:1142.5   3rd Qu.:81.47
 Max.   :48900.0   Max.   :10.700   Max.   :1310.0   Max.   :86.40
 satisfaction_score
 Min.   :6.50
 1st Qu.:7.30
 Median :7.65
 Mean   :7.63
 3rd Qu.:8.00
 Max.   :8.40      

For numeric columns, you get the minimum, maximum, mean, median, and quartiles. For character columns, you get the length and type.

table() — Frequency tables

In SPSS: Analyze > Descriptive Statistics > Frequencies. In R:

R

table(visitors$origin)

OUTPUT


       Canada      Colombia   Netherlands         Other United States
           20            20            20            20            20
    Venezuela
           20 

The $ operator extracts a single column from a data frame. So visitors$origin means “the origin column from the visitors dataset” — like clicking on a single variable in SPSS.

You can also make two-way frequency tables:

R

table(visitors$year, visitors$origin)

OUTPUT


       Canada Colombia Netherlands Other United States Venezuela
  2019      4        4           4     4             4         4
  2020      4        4           4     4             4         4
  2021      4        4           4     4             4         4
  2022      4        4           4     4             4         4
  2023      4        4           4     4             4         4
  • Spend time on the RStudio pane orientation. Have participants identify each pane on their own screen before moving on.
  • The <- assignment operator trips people up. Give them a few minutes to practice creating objects with different names and values.
  • When loading tidyverse, the startup messages can be alarming to beginners. Reassure them that the “Attaching packages” and “Conflicts” messages are normal and expected.
  • Do the “Before you import — set up your workshop folder” subsection as a whole-room moment, not as reading. Project the CSV download page, walk everyone through the browser download, the New Project dialog, and creating the data subfolder. Wait for green sticky notes in the Files pane before typing read_csv(). This is the most common point of failure in the course and it is worth five deliberate minutes up front to avoid twenty scattered minutes of troubleshooting later.
  • If a participant cannot download (blocked network, locked laptop), have a USB stick or shared-drive copy of aruba_visitors.csv ready as fallback.
Challenge

Challenge 1: Explore the Aruba visitors dataset

Import the Aruba visitors dataset and answer the following questions using R functions. Write your code in the Source Editor and run each line.

  1. How many rows and how many columns does the dataset have?
  2. What data type is the origin column? What data type is visitors_stayover?
  3. What is the mean avg_spending_usd across all rows?
  4. How many rows are there for each origin country?

R

# Load the data (if not already loaded)
library(tidyverse)
visitors <- read_csv("data/aruba_visitors.csv")

OUTPUT

Rows: 120 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): quarter, origin
dbl (7): year, visitors_stayover, visitors_cruise, avg_stay_nights, avg_spen...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Question 1: How many rows and columns?

R

# Either of these works:
dim(visitors)

OUTPUT

[1] 120   9

R

glimpse(visitors)

OUTPUT

Rows: 120
Columns: 9
$ year                <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 20…
$ quarter             <chr> "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q2", "Q2", "Q…
$ origin              <chr> "United States", "Netherlands", "Venezuela", "Colo…
$ visitors_stayover   <dbl> 72450, 18300, 8200, 6100, 4500, 9800, 65200, 15400…
$ visitors_cruise     <dbl> 45200, 1200, 400, 800, 3200, 5600, 38100, 900, 300…
$ avg_stay_nights     <dbl> 6.8, 10.2, 5.1, 4.8, 7.1, 5.5, 6.5, 9.8, 4.9, 4.6,…
$ avg_spending_usd    <dbl> 1250, 980, 620, 710, 1180, 890, 1220, 960, 600, 69…
$ hotel_occupancy_pct <dbl> 82.3, 82.3, 82.3, 82.3, 82.3, 82.3, 78.1, 78.1, 78…
$ satisfaction_score  <dbl> 8.1, 7.9, 7.5, 7.7, 8.0, 7.6, 8.0, 7.8, 7.4, 7.6, …

The dataset has 120 rows and 9 columns (5 years x 4 quarters x 6 origin countries = 120 rows).

Question 2: Data types?

R

str(visitors)

OUTPUT

spc_tbl_ [120 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ year               : num [1:120] 2019 2019 2019 2019 2019 ...
 $ quarter            : chr [1:120] "Q1" "Q1" "Q1" "Q1" ...
 $ origin             : chr [1:120] "United States" "Netherlands" "Venezuela" "Colombia" ...
 $ visitors_stayover  : num [1:120] 72450 18300 8200 6100 4500 ...
 $ visitors_cruise    : num [1:120] 45200 1200 400 800 3200 5600 38100 900 300 700 ...
 $ avg_stay_nights    : num [1:120] 6.8 10.2 5.1 4.8 7.1 5.5 6.5 9.8 4.9 4.6 ...
 $ avg_spending_usd   : num [1:120] 1250 980 620 710 1180 890 1220 960 600 690 ...
 $ hotel_occupancy_pct: num [1:120] 82.3 82.3 82.3 82.3 82.3 82.3 78.1 78.1 78.1 78.1 ...
 $ satisfaction_score : num [1:120] 8.1 7.9 7.5 7.7 8 7.6 8 7.8 7.4 7.6 ...
 - attr(*, "spec")=
  .. cols(
  ..   year = col_double(),
  ..   quarter = col_character(),
  ..   origin = col_character(),
  ..   visitors_stayover = col_double(),
  ..   visitors_cruise = col_double(),
  ..   avg_stay_nights = col_double(),
  ..   avg_spending_usd = col_double(),
  ..   hotel_occupancy_pct = col_double(),
  ..   satisfaction_score = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

origin is character (chr), visitors_stayover is numeric (num).

Question 3: Mean average spending?

R

summary(visitors$avg_spending_usd)

OUTPUT

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  400.0   677.5   900.0   898.2  1142.5  1310.0 

The mean avg_spending_usd is shown in the summary output. You can also get just the mean with:

R

mean(visitors$avg_spending_usd)

OUTPUT

[1] 898.1667

Question 4: Rows per origin country?

R

table(visitors$origin)

OUTPUT


       Canada      Colombia   Netherlands         Other United States
           20            20            20            20            20
    Venezuela
           20 

Each origin country has 20 rows (4 quarters x 5 years).

Challenge

Challenge 2: Practice with objects and functions

  1. Create an object called my_island that stores the text "Aruba".
  2. Create an object called area_km2 that stores the value 180.
  3. Use the nchar() function to count the number of characters in my_island.
  4. Use round() to round the mean of avg_spending_usd to the nearest whole number. (Hint: you can put one function inside another.)

R

# 1 and 2: Create objects
my_island <- "Aruba"
area_km2 <- 180

# 3: Count characters
nchar(my_island)

OUTPUT

[1] 5

R

# 4: Round the mean spending
round(mean(visitors$avg_spending_usd), digits = 0)

OUTPUT

[1] 898

Nesting functions (putting one inside another) is common in R. R evaluates from the inside out: first it calculates mean(visitors$avg_spending_usd), then it passes that result to round().

Summary


You have now completed your first hands-on R session. You can:

  • Find your way around RStudio
  • Create objects and use functions
  • Install and load packages
  • Import a CSV file
  • Inspect your data with View(), head(), str(), glimpse(), summary(), and table()

In SPSS terms, you have learned the equivalent of opening a dataset, switching between Data View and Variable View, and running Descriptives and Frequencies. The difference is that everything you did is saved in a script that you can re-run at any time.

Key Points
  • RStudio is your workspace — it combines a script editor, console, and data viewer
  • haven::read_sav() imports SPSS files directly, preserving labels
  • summary(), table(), and str() replace the Descriptives and Frequencies menus in SPSS

Content from Data Manipulation


Last updated on 2026-04-22 | Edit this page

Estimated time: 90 minutes

Overview

Questions

  • How do I filter, sort, and recode data in R the way I do in SPSS?
  • What is the tidyverse and why does it matter?
  • How do I create new variables from existing ones?

Objectives

  • Filter rows, select columns, and sort data using dplyr
  • Create new variables and recode existing ones
  • Chain operations together using the pipe operator
  • Recognize the SPSS menu equivalent for each operation
Cartoon of a researcher as a Caribbean chef with jars labeled filter(), select(), and mutate(), cooking a pot of tidy data
You can’t cook without ingredients. You can’t wrangle without verbs.

The tidyverse approach


In SPSS, you manipulate data through menus: Data > Select Cases, Data > Sort Cases, Transform > Compute Variable, and so on. In R, the dplyr package gives you a set of verbs — functions with plain-English names that do exactly what they say.

The dplyr package is part of the tidyverse, which you already loaded in the previous episode. If you are starting a fresh R session, load it now:

R

library(tidyverse)

OUTPUT

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

R

# Also load our dataset
visitors <- read_csv("data/aruba_visitors.csv")

OUTPUT

Rows: 120 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): quarter, origin
dbl (7): year, visitors_stayover, visitors_cruise, avg_stay_nights, avg_spen...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Callout

What is a dependency?

A package in R is a bundle of code that someone else wrote so you do not have to. tidyverse, for example, is actually a collection of packages that work together for data manipulation and visualisation. When you use library(tidyverse), R loads those packages into your session so their functions become available.

A dependency is a package that another package needs in order to work. tidyverse depends on dplyr, ggplot2, readr, and several others. When you run install.packages("tidyverse"), R automatically installs everything it depends on too. You do not have to manage the chain manually.

Two things this means in practice. First, the first time you install a package it can take a minute or two because R is pulling down the dependency chain. That is normal. Second, when you share your script with a colleague and they get an error like there is no package called 'dplyr', the fix is almost always install.packages("tidyverse"), not install.packages("dplyr"), because the dependency lives inside the larger package.

We come back to this in Episode 6 (reproducible reporting), where recording which packages your script needs is part of making sure it still runs six months from now.

Here is the key idea: every SPSS menu operation you use for data manipulation has a dplyr verb equivalent.

SPSS menu path dplyr verb What it does
Data > Select Cases filter() Keep rows that match a condition
(selecting columns in Variable View) select() Keep or drop columns
Transform > Compute Variable mutate() Create or modify a column
Transform > Recode into Different Variables case_when() Assign values based on conditions
Data > Sort Cases arrange() Sort rows
Data > Split File + Aggregate group_by() + summarise() Calculate summaries by group

Let us work through each one.

filter() — Select Cases


In SPSS, you would go to Data > Select Cases, click “If condition is satisfied”, and type a condition like origin = "United States". In R:

R

us_visitors <- filter(visitors, origin == "United States")
us_visitors

OUTPUT

# A tibble: 20 × 9
    year quarter origin        visitors_stayover visitors_cruise avg_stay_nights
   <dbl> <chr>   <chr>                     <dbl>           <dbl>           <dbl>
 1  2019 Q1      United States             72450           45200             6.8
 2  2019 Q2      United States             65200           38100             6.5
 3  2019 Q3      United States             58900           32400             6.2
 4  2019 Q4      United States             69800           42800             6.7
 5  2020 Q1      United States             68100           40200             6.6
 6  2020 Q2      United States              4200               0             5.8
 7  2020 Q3      United States             18500               0             6
 8  2020 Q4      United States             42100            8200             6.3
 9  2021 Q1      United States             48200           12400             6.4
10  2021 Q2      United States             52800           18600             6.5
11  2021 Q3      United States             55400           25200             6.3
12  2021 Q4      United States             64200           38500             6.6
13  2022 Q1      United States             74800           46800             6.9
14  2022 Q2      United States             67500           39200             6.6
15  2022 Q3      United States             61200           34600             6.3
16  2022 Q4      United States             72100           44500             6.8
17  2023 Q1      United States             78200           48900             7
18  2023 Q2      United States             70100           41200             6.7
19  2023 Q3      United States             63500           36200             6.4
20  2023 Q4      United States             75600           46100             6.9
# ℹ 3 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
#   satisfaction_score <dbl>

Notice the double equals sign ==. This is how R tests equality. A single = is for assigning values to function arguments (like digits = 1); a double == asks “is this equal to?”.

You can combine conditions:

R

# US visitors in 2023 only
us_2023 <- filter(visitors, origin == "United States", year == 2023)
us_2023

OUTPUT

# A tibble: 4 × 9
   year quarter origin        visitors_stayover visitors_cruise avg_stay_nights
  <dbl> <chr>   <chr>                     <dbl>           <dbl>           <dbl>
1  2023 Q1      United States             78200           48900             7
2  2023 Q2      United States             70100           41200             6.7
3  2023 Q3      United States             63500           36200             6.4
4  2023 Q4      United States             75600           46100             6.9
# ℹ 3 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
#   satisfaction_score <dbl>

R

# Rows where stayover visitors exceeded 50,000
big_quarters <- filter(visitors, visitors_stayover > 50000)
big_quarters

OUTPUT

# A tibble: 16 × 9
    year quarter origin        visitors_stayover visitors_cruise avg_stay_nights
   <dbl> <chr>   <chr>                     <dbl>           <dbl>           <dbl>
 1  2019 Q1      United States             72450           45200             6.8
 2  2019 Q2      United States             65200           38100             6.5
 3  2019 Q3      United States             58900           32400             6.2
 4  2019 Q4      United States             69800           42800             6.7
 5  2020 Q1      United States             68100           40200             6.6
 6  2021 Q2      United States             52800           18600             6.5
 7  2021 Q3      United States             55400           25200             6.3
 8  2021 Q4      United States             64200           38500             6.6
 9  2022 Q1      United States             74800           46800             6.9
10  2022 Q2      United States             67500           39200             6.6
11  2022 Q3      United States             61200           34600             6.3
12  2022 Q4      United States             72100           44500             6.8
13  2023 Q1      United States             78200           48900             7
14  2023 Q2      United States             70100           41200             6.7
15  2023 Q3      United States             63500           36200             6.4
16  2023 Q4      United States             75600           46100             6.9
# ℹ 3 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
#   satisfaction_score <dbl>
Callout

Common comparison operators

Operator Meaning Example
== equal to origin == "Canada"
!= not equal to origin != "Other"
> greater than visitors_stayover > 10000
< less than avg_stay_nights < 5
>= greater than or equal to year >= 2021
<= less than or equal to satisfaction_score <= 7.5
%in% matches one of several values origin %in% c("United States", "Canada")

select() — Keep or drop columns


Sometimes your dataset has more columns than you need. In SPSS, you might delete variables or simply ignore them. In R, select() lets you keep only the columns you want:

R

# Keep only year, quarter, origin, and stayover visitors
slim <- select(visitors, year, quarter, origin, visitors_stayover)
head(slim)

OUTPUT

# A tibble: 6 × 4
   year quarter origin        visitors_stayover
  <dbl> <chr>   <chr>                     <dbl>
1  2019 Q1      United States             72450
2  2019 Q1      Netherlands               18300
3  2019 Q1      Venezuela                  8200
4  2019 Q1      Colombia                   6100
5  2019 Q1      Canada                     4500
6  2019 Q1      Other                      9800

You can also drop columns by putting a minus sign in front:

R

# Drop the satisfaction score column
no_satisfaction <- select(visitors, -satisfaction_score)
head(no_satisfaction)

OUTPUT

# A tibble: 6 × 8
   year quarter origin        visitors_stayover visitors_cruise avg_stay_nights
  <dbl> <chr>   <chr>                     <dbl>           <dbl>           <dbl>
1  2019 Q1      United States             72450           45200             6.8
2  2019 Q1      Netherlands               18300            1200            10.2
3  2019 Q1      Venezuela                  8200             400             5.1
4  2019 Q1      Colombia                   6100             800             4.8
5  2019 Q1      Canada                     4500            3200             7.1
6  2019 Q1      Other                      9800            5600             5.5
# ℹ 2 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>

mutate() — Compute Variable


In SPSS: Transform > Compute Variable. You would type a target variable name, then an expression. In R, mutate() creates a new column (or modifies an existing one):

R

visitors <- mutate(visitors,
  total_visitors = visitors_stayover + visitors_cruise
)
head(select(visitors, year, quarter, origin, visitors_stayover, visitors_cruise, total_visitors))

OUTPUT

# A tibble: 6 × 6
   year quarter origin        visitors_stayover visitors_cruise total_visitors
  <dbl> <chr>   <chr>                     <dbl>           <dbl>          <dbl>
1  2019 Q1      United States             72450           45200         117650
2  2019 Q1      Netherlands               18300            1200          19500
3  2019 Q1      Venezuela                  8200             400           8600
4  2019 Q1      Colombia                   6100             800           6900
5  2019 Q1      Canada                     4500            3200           7700
6  2019 Q1      Other                      9800            5600          15400

You can create multiple columns at once:

R

visitors <- mutate(visitors,
  spending_per_night = avg_spending_usd / avg_stay_nights,
  stayover_pct = visitors_stayover / total_visitors * 100
)

case_when() — Recode into Different Variables


In SPSS: Transform > Recode into Different Variables, where you map old values to new values. In R, you use case_when() inside mutate():

R

visitors <- mutate(visitors,
  region = case_when(
    origin == "United States" ~ "North America",
    origin == "Canada"        ~ "North America",
    origin == "Netherlands"   ~ "Europe",
    origin == "Venezuela"     ~ "South America",
    origin == "Colombia"      ~ "South America",
    .default                  = "Other"
  )
)

table(visitors$region)

OUTPUT


       Europe North America         Other South America
           20            40            20            40 

The syntax is: condition ~ value_to_assign. The .default line catches everything that did not match a previous condition — like the “Else” box in SPSS Recode.

arrange() — Sort Cases


In SPSS: Data > Sort Cases. In R:

R

# Sort by stayover visitors, ascending (default)
arrange(visitors, visitors_stayover)

OUTPUT

# A tibble: 120 × 13
    year quarter origin      visitors_stayover visitors_cruise avg_stay_nights
   <dbl> <chr>   <chr>                   <dbl>           <dbl>           <dbl>
 1  2020 Q2      Canada                    180               0             5.2
 2  2020 Q2      Venezuela                 200               0             3.5
 3  2020 Q2      Colombia                  300               0             3.8
 4  2020 Q2      Other                     400               0             4.1
 5  2020 Q3      Venezuela                 800               0             3.9
 6  2020 Q2      Netherlands              1100               0             8.5
 7  2020 Q3      Canada                   1100               0             5.8
 8  2020 Q3      Colombia                 1200               0             4
 9  2020 Q4      Venezuela                1800               0             4.1
10  2021 Q1      Venezuela                2100              50             4
# ℹ 110 more rows
# ℹ 7 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
#   satisfaction_score <dbl>, total_visitors <dbl>, spending_per_night <dbl>,
#   stayover_pct <dbl>, region <chr>

R

# Sort descending — use desc()
arrange(visitors, desc(visitors_stayover))

OUTPUT

# A tibble: 120 × 13
    year quarter origin        visitors_stayover visitors_cruise avg_stay_nights
   <dbl> <chr>   <chr>                     <dbl>           <dbl>           <dbl>
 1  2023 Q1      United States             78200           48900             7
 2  2023 Q4      United States             75600           46100             6.9
 3  2022 Q1      United States             74800           46800             6.9
 4  2019 Q1      United States             72450           45200             6.8
 5  2022 Q4      United States             72100           44500             6.8
 6  2023 Q2      United States             70100           41200             6.7
 7  2019 Q4      United States             69800           42800             6.7
 8  2020 Q1      United States             68100           40200             6.6
 9  2022 Q2      United States             67500           39200             6.6
10  2019 Q2      United States             65200           38100             6.5
# ℹ 110 more rows
# ℹ 7 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
#   satisfaction_score <dbl>, total_visitors <dbl>, spending_per_night <dbl>,
#   stayover_pct <dbl>, region <chr>

R

# Sort by multiple columns: year first, then by stayover visitors descending
arrange(visitors, year, desc(visitors_stayover))

OUTPUT

# A tibble: 120 × 13
    year quarter origin        visitors_stayover visitors_cruise avg_stay_nights
   <dbl> <chr>   <chr>                     <dbl>           <dbl>           <dbl>
 1  2019 Q1      United States             72450           45200             6.8
 2  2019 Q4      United States             69800           42800             6.7
 3  2019 Q2      United States             65200           38100             6.5
 4  2019 Q3      United States             58900           32400             6.2
 5  2019 Q3      Netherlands               21200             800            10.5
 6  2019 Q4      Netherlands               19600            1100            10.4
 7  2019 Q1      Netherlands               18300            1200            10.2
 8  2019 Q2      Netherlands               15400             900             9.8
 9  2019 Q1      Other                      9800            5600             5.5
10  2019 Q4      Other                      8900            5100             5.4
# ℹ 110 more rows
# ℹ 7 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
#   satisfaction_score <dbl>, total_visitors <dbl>, spending_per_night <dbl>,
#   stayover_pct <dbl>, region <chr>

group_by() + summarise() — Split File and Aggregate


This is one of the most powerful combinations in dplyr, and it replaces two SPSS operations at once:

  • Data > Split File (which tells SPSS to run analyses separately for each group)
  • Data > Aggregate (which calculates summary statistics by group)

R

# Average stayover visitors per year
yearly_summary <- visitors |>
  group_by(year) |>
  summarise(
    mean_stayover = mean(visitors_stayover),
    total_stayover = sum(visitors_stayover)
  )
yearly_summary

OUTPUT

# A tibble: 5 × 3
   year mean_stayover total_stayover
  <dbl>         <dbl>          <dbl>
1  2019        18360.         440650
2  2020         8687.         208480
3  2021        14883.         357200
4  2022        18488.         443700
5  2023        19321.         463700

Wait — what is that |> symbol? That is the pipe operator, and it deserves its own section.

The pipe operator |>


The pipe |> is one of the most important ideas in modern R. Read it as “and then”. It takes the result of the expression on the left and passes it as the first argument to the function on the right.

Without the pipe, you would write:

R

# Nested style (hard to read)
summarise(group_by(filter(visitors, origin == "United States"), year), mean_spend = mean(avg_spending_usd))

That is like reading a sentence from the inside out. With the pipe, the same code becomes:

R

visitors |>
  filter(origin == "United States") |>
  group_by(year) |>
  summarise(mean_spend = mean(avg_spending_usd))

OUTPUT

# A tibble: 5 × 2
   year mean_spend
  <dbl>      <dbl>
1  2019      1225
2  2020      1170
3  2021      1215
4  2022      1252.
5  2023      1275 

Read this as: “Take visitors, and then filter to US rows, and then group by year, and then summarise mean spending.”

The pipe makes your code read from top to bottom, like a recipe. Each line is one step.

Callout

|> vs %>%

You may see %>% in older R code and tutorials. This is the original pipe operator from the magrittr package. The native pipe |> was added to base R in version 4.1 (2021) and works without loading any packages. They behave almost identically. We use |> in this course because it requires no extra dependencies.

Building up a pipeline step by step

A good workflow is to build your pipeline one step at a time, checking the result after each line. Let us work through an example:

R

# Step 1: Start with the data
visitors |>
  filter(year >= 2022)

OUTPUT

# A tibble: 48 × 13
    year quarter origin        visitors_stayover visitors_cruise avg_stay_nights
   <dbl> <chr>   <chr>                     <dbl>           <dbl>           <dbl>
 1  2022 Q1      United States             74800           46800             6.9
 2  2022 Q1      Netherlands               19100            1300            10.3
 3  2022 Q1      Venezuela                  4800             200             4.4
 4  2022 Q1      Colombia                   5900             750             4.7
 5  2022 Q1      Canada                     4600            3300             7.2
 6  2022 Q1      Other                     10200            5800             5.6
 7  2022 Q2      United States             67500           39200             6.6
 8  2022 Q2      Netherlands               16200            1000            10
 9  2022 Q2      Venezuela                  4200             150             4.3
10  2022 Q2      Colombia                   5600             650             4.6
# ℹ 38 more rows
# ℹ 7 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
#   satisfaction_score <dbl>, total_visitors <dbl>, spending_per_night <dbl>,
#   stayover_pct <dbl>, region <chr>

R

# Step 2: Add a column selection
visitors |>
  filter(year >= 2022) |>
  select(year, quarter, origin, visitors_stayover, avg_spending_usd)

OUTPUT

# A tibble: 48 × 5
    year quarter origin        visitors_stayover avg_spending_usd
   <dbl> <chr>   <chr>                     <dbl>            <dbl>
 1  2022 Q1      United States             74800             1280
 2  2022 Q1      Netherlands               19100             1000
 3  2022 Q1      Venezuela                  4800              570
 4  2022 Q1      Colombia                   5900              700
 5  2022 Q1      Canada                     4600             1200
 6  2022 Q1      Other                     10200              900
 7  2022 Q2      United States             67500             1250
 8  2022 Q2      Netherlands               16200              980
 9  2022 Q2      Venezuela                  4200              550
10  2022 Q2      Colombia                   5600              690
# ℹ 38 more rows

R

# Step 3: Group and summarise
visitors |>
  filter(year >= 2022) |>
  group_by(origin) |>
  summarise(
    mean_stayover = mean(visitors_stayover),
    mean_spending = mean(avg_spending_usd)
  ) |>
  arrange(desc(mean_stayover))

OUTPUT

# A tibble: 6 × 3
  origin        mean_stayover mean_spending
  <chr>                 <dbl>         <dbl>
1 United States        70375          1264.
2 Netherlands          20038.         1011.
3 Other                 9275           888.
4 Colombia              5600           696.
5 Venezuela             4250           552.
6 Canada                3888.         1176.

This final pipeline reads: “Take visitors, keep only 2022 and later, group by origin, calculate mean stayover visitors and mean spending, then sort by mean stayover visitors in descending order.”

  • Write the pipe |> on the whiteboard and say “and then” out loud every time you use it. This mental model sticks.
  • Build pipelines live, one step at a time. Run after each added line so participants can see how the output changes.
  • The most common beginner mistake is putting |> at the start of a line instead of at the end of the previous line. Emphasize: the pipe goes at the end of the line, so R knows the expression continues.
  • Compare nested function calls to piped code side by side. The readability advantage sells itself.
Challenge

Challenge 1: US visitors analysis

Using the visitors dataset and the pipe operator, write a pipeline that:

  1. Filters to only United States visitors
  2. Creates a new column called total_visitors that adds visitors_stayover and visitors_cruise
  3. Groups by year
  4. Calculates the total (sum) of total_visitors for each year

Save the result to an object called us_yearly and print it. Which year had the fewest total US visitors? Why might that be?

R

us_yearly <- visitors |>
  filter(origin == "United States") |>
  mutate(total_visitors = visitors_stayover + visitors_cruise) |>
  group_by(year) |>
  summarise(total = sum(total_visitors))

us_yearly

OUTPUT

# A tibble: 5 × 2
   year  total
  <dbl>  <dbl>
1  2019 424850
2  2020 181300
3  2021 315300
4  2022 440700
5  2023 459800

2020 had by far the fewest US visitors, due to the COVID-19 pandemic and associated travel restrictions. You can see the recovery in subsequent years.

Challenge

Challenge 2: Top spending origins

Write a pipeline that finds the average avg_spending_usd for each origin country across the entire dataset, then sorts the result from highest to lowest. Which origin country has the highest average spending?

R

visitors |>
  group_by(origin) |>
  summarise(mean_spending = mean(avg_spending_usd)) |>
  arrange(desc(mean_spending))

OUTPUT

# A tibble: 6 × 2
  origin        mean_spending
  <chr>                 <dbl>
1 United States         1228.
2 Canada                1139
3 Netherlands            978.
4 Other                  850
5 Colombia               654
6 Venezuela              540 

The United States has the highest average spending per visitor, followed by Canada. This makes sense given the typical length of stay and the types of tourism activities associated with visitors from these markets.

Challenge

Challenge 3: Regional summary

Using the region column we created earlier with case_when(), write a pipeline that calculates the total visitors_stayover and mean satisfaction_score for each region and each year. Sort by year and then by total stayover visitors descending.

Hint: you will need to group_by() two columns.

R

visitors |>
  group_by(region, year) |>
  summarise(
    total_stayover = sum(visitors_stayover),
    mean_satisfaction = mean(satisfaction_score),
    .groups = "drop"
  ) |>
  arrange(year, desc(total_stayover))

OUTPUT

# A tibble: 20 × 4
   region         year total_stayover mean_satisfaction
   <chr>         <dbl>          <dbl>             <dbl>
 1 North America  2019         280950              8
 2 Europe         2019          74500              7.92
 3 South America  2019          50500              7.45
 4 Other          2019          34700              7.5
 5 North America  2020         141180              7.56
 6 Europe         2020          35500              7.48
 7 Other          2020          16600              7.2
 8 South America  2020          15200              6.98
 9 North America  2021         233500              7.88
10 Europe         2021          65600              7.82
11 South America  2021          29800              7.15
12 Other          2021          28300              7.4
13 North America  2022         290800              8.1
14 Europe         2022          78100              8.02
15 South America  2022          38500              7.34
16 Other          2022          36300              7.6
17 North America  2023         303300              8.19
18 Europe         2023          82200              8.12
19 South America  2023          40300              7.44
20 Other          2023          37900              7.68

The .groups = "drop" argument tells summarise() to remove the grouping after calculation. Without it, the result would still be grouped by region, which can cause unexpected behaviour in later steps.

Summary


You now know the six core dplyr verbs and can map each one to its SPSS equivalent:

You used to… Now you write…
Data > Select Cases filter()
Select columns in Variable View select()
Transform > Compute Variable mutate()
Transform > Recode mutate() + case_when()
Data > Sort Cases arrange()
Data > Split File + Aggregate group_by() + summarise()

And you connect them all with |> — “and then” — to build readable, reproducible data pipelines.

Key Points
  • dplyr verbs (filter, select, mutate, arrange, summarise) replace SPSS menu operations
  • The pipe operator |> chains operations together, making code readable
  • group_by() combined with summarise() replaces SPSS Split File + Aggregate

Content from Visualization with ggplot2


Last updated on 2026-04-22 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How do I create charts in R that look better than SPSS Chart Builder output?
  • What is the ggplot2 “grammar of graphics” approach?
  • How do I customize colors, labels, and themes?

Objectives

  • Build bar charts, histograms, scatterplots, and line charts with ggplot2
  • Customize plots with labels, colors, and themes for publication quality
  • Compare ggplot2 output with SPSS Chart Builder equivalents
  • Create faceted plots to compare groups
Cartoon of a researcher painting a tropical sunset onto a ggplot2 canvas with flamingos and plot axes
SPSS gives you a chart. ggplot2 gives you a language.

The grammar of graphics


In SPSS, you create charts through the Chart Builder dialog: you drag variables onto axes, pick a chart type, and click OK. The result is a finished chart, but customizing it requires clicking through many menus.

ggplot2 takes a fundamentally different approach called the grammar of graphics. Instead of picking a finished chart type, you build a plot layer by layer, like constructing a sentence:

  1. Data — what data frame are you plotting?
  2. Aesthetics (aes()) — which variables map to the x-axis, y-axis, color, size, etc.?
  3. Geometry (geom_*()) — what visual marks represent the data (bars, points, lines)?
  4. Labels (labs()) — what titles and axis labels should appear?
  5. Theme (theme_*()) — what overall style should the plot have?

You combine these layers with the + operator. Let’s see how this works in practice. First, let’s load our packages and data:

R

library(tidyverse)

visitors <- read_csv("data/aruba_visitors.csv")

Here is the simplest possible ggplot call — just the data and aesthetics, with no geometry yet:

R

ggplot(data = visitors, aes(x = origin, y = visitors_stayover))

This gives us an empty canvas with axes. Now we add a geometry layer:

R

ggplot(data = visitors, aes(x = origin, y = visitors_stayover)) +
  geom_col()

That is already a bar chart. The + operator is how you add layers — think of it as stacking transparencies on top of each other.

Callout

The + operator vs. the pipe |>

The pipe |> passes data into a function. The + in ggplot2 adds layers to a plot. They look similar but do different things. A common beginner mistake is using |> where + is needed:

R

# WRONG --- this will produce an error
ggplot(visitors, aes(x = origin)) |> geom_bar()

# CORRECT
ggplot(visitors, aes(x = origin)) + geom_bar()

Common chart types


Below is a reference table mapping SPSS Chart Builder chart types to their ggplot2 equivalents:

Chart type SPSS menu path ggplot2 geometry
Bar chart Graphs > Chart Builder > Bar geom_bar() / geom_col()
Histogram Graphs > Chart Builder > Histogram geom_histogram()
Scatterplot Graphs > Chart Builder > Scatter/Dot geom_point()
Line chart Graphs > Chart Builder > Line geom_line()
Boxplot Graphs > Chart Builder > Boxplot geom_boxplot()

Let’s work through each one using the Aruba visitors data.

Bar chart: geom_bar() and geom_col()

There are two bar chart geoms. Use geom_bar() when you want R to count rows for you, and geom_col() when you already have the values to plot.

R

# geom_bar() counts the rows per origin (6 origins x 20 quarters = 20 rows each)
ggplot(visitors, aes(x = origin)) +
  geom_bar()

R

# geom_col() uses a pre-computed value on the y-axis
visitors_total <- visitors |>
  group_by(origin) |>
  summarise(total_stayover = sum(visitors_stayover))

ggplot(visitors_total, aes(x = reorder(origin, -total_stayover),
                            y = total_stayover)) +
  geom_col()
Callout

geom_bar() vs. geom_col() — when to use which?

  • geom_bar() uses stat = "count" by default: it counts how many rows fall into each category. You only need an x aesthetic.
  • geom_col() uses stat = "identity": it plots the actual value you supply. You need both x and y.

In SPSS Chart Builder, when you drag a categorical variable to the x-axis and a scale variable to the y-axis with “Mean” as the summary, that is equivalent to first computing the mean with summarise() and then using geom_col().

Histogram: geom_histogram()

In SPSS: Graphs > Chart Builder, drag a scale variable to the x-axis and select the Histogram type.

R

ggplot(visitors, aes(x = avg_spending_usd)) +
  geom_histogram(binwidth = 50, color = "white")

The binwidth argument controls how wide each bin is. Experiment with different values to see how the shape of the distribution changes.

Scatterplot: geom_point()

In SPSS: Graphs > Chart Builder, drag variables to x and y axes and select Simple Scatter.

R

ggplot(visitors, aes(x = avg_spending_usd, y = satisfaction_score)) +
  geom_point()

You can map additional variables to aesthetics like color and size:

R

ggplot(visitors, aes(x = avg_spending_usd, y = satisfaction_score,
                     color = origin)) +
  geom_point(size = 2, alpha = 0.7)

Line chart: geom_line()

Line charts are great for showing trends over time. Let’s compute quarterly totals and plot them:

R

quarterly_totals <- visitors |>
  group_by(year, quarter) |>
  summarise(total_stayover = sum(visitors_stayover), .groups = "drop") |>
  mutate(date_label = paste(year, quarter, sep = "-"))

ggplot(quarterly_totals, aes(x = date_label, y = total_stayover, group = 1)) +
  geom_line() +
  geom_point()

That works, but the x-axis labels overlap. Let’s also see how to plot multiple lines by origin:

R

visitors_by_qtr <- visitors |>
  mutate(date_label = paste(year, quarter, sep = "-"))

ggplot(visitors_by_qtr, aes(x = date_label, y = visitors_stayover,
                             color = origin, group = origin)) +
  geom_line() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Boxplot: geom_boxplot()

In SPSS: Graphs > Chart Builder, select Boxplot and drag a grouping variable to the x-axis and a scale variable to the y-axis.

R

ggplot(visitors, aes(x = origin, y = avg_spending_usd)) +
  geom_boxplot()

Making it publication-ready


So far our plots have been functional but plain. Let’s take a bar chart through the full journey from basic to polished. This is where ggplot2 truly outshines SPSS Chart Builder — every tweak is a single line of code.

Step 1: Basic chart

R

p <- ggplot(visitors_total, aes(x = reorder(origin, -total_stayover),
                                 y = total_stayover)) +
  geom_col()
p

Step 2: Add labels

R

p <- p +
  labs(
    title = "Stayover Visitors to Aruba by Origin Country",
    subtitle = "Total visitors, 2019--2023",
    x = "Country of Origin",
    y = "Total Stayover Visitors",
    caption = "Source: Aruba visitors dataset"
  )
p

Step 3: Apply a clean theme

R

p <- p + theme_minimal()
p

Step 4: Customize colors

R

p <- p +
  geom_col(fill = "#2a9d8f") +
  theme_minimal(base_size = 13)
p

Step 5: Fine-tune text and formatting

R

p <- p +
  scale_y_continuous(labels = scales::comma) +
  theme(
    plot.title = element_text(face = "bold"),
    axis.text.x = element_text(angle = 30, hjust = 1)
  )
p
Callout

Saving your plot

Use ggsave() to export your plot as a PNG, PDF, or SVG file:

R

ggsave("my_plot.png", plot = p, width = 8, height = 5, dpi = 300)

In SPSS you right-click the chart and choose Export — ggsave() gives you precise control over dimensions and resolution, which is exactly what journals require.

Faceting: small multiples


Faceting is one of ggplot2’s most powerful features and something SPSS Chart Builder handles poorly. Instead of cramming all groups onto one chart, you split the data into panels — one per group.

R

visitors_annual <- visitors |>
  group_by(year, origin) |>
  summarise(total_stayover = sum(visitors_stayover), .groups = "drop")

ggplot(visitors_annual, aes(x = year, y = total_stayover)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ origin, scales = "free_y") +
  labs(
    title = "Annual Stayover Visitors by Origin",
    x = "Year",
    y = "Total Stayover Visitors"
  ) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal()

The scales = "free_y" argument lets each panel have its own y-axis range. This is important when groups have very different magnitudes (e.g., US visitors vastly outnumber Canadian visitors).

The faceting example is a great place to pause and let learners experiment. Encourage them to try:

  • facet_wrap(~ origin, ncol = 2) to control the layout
  • facet_grid(origin ~ .) for a grid arrangement
  • removing scales = "free_y" to see the difference
Challenge

Challenge 1: Build a publication-quality faceted chart

Using the visitors data, create a scatterplot of avg_spending_usd (x-axis) vs. satisfaction_score (y-axis), with the following requirements:

  1. Color the points by origin
  2. Add a linear trend line using geom_smooth(method = "lm")
  3. Facet by year
  4. Add a proper title, axis labels, and caption
  5. Use theme_minimal() and make the title bold

R

ggplot(visitors, aes(x = avg_spending_usd, y = satisfaction_score,
                     color = origin)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.8) +
  facet_wrap(~ year) +
  labs(
    title = "Spending vs. Satisfaction by Origin Country",
    subtitle = "Each panel represents one year (2019--2023)",
    x = "Average Spending (USD)",
    y = "Satisfaction Score (1--10)",
    color = "Origin",
    caption = "Source: Aruba visitors dataset"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    legend.position = "bottom"
  )

OUTPUT

`geom_smooth()` using formula = 'y ~ x'
Challenge

Challenge 2: Recreate an SPSS-style chart

In SPSS, a common chart is a clustered bar chart showing means by group. Create the R equivalent: a bar chart showing mean stay-over visitors per quarter for each origin country, with bars colored by origin. Add error bars using stat_summary().

Hint: You can use stat_summary(fun = mean, geom = "col") to compute the mean within the plot itself, without pre-computing it.

R

ggplot(visitors, aes(x = origin, y = visitors_stayover, fill = origin)) +
  stat_summary(fun = mean, geom = "col", width = 0.7) +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.3) +
  labs(
    title = "Mean Quarterly Stayover Visitors by Origin",
    x = "Country of Origin",
    y = "Mean Stayover Visitors per Quarter"
  ) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    axis.text.x = element_text(angle = 30, hjust = 1),
    legend.position = "none"
  )
Key Points
  • ggplot2 builds plots in layers: data, aesthetics, geometry, labels, theme
  • Every SPSS Chart Builder chart has a ggplot2 equivalent that offers more control
  • Faceting (facet_wrap) lets you create small multiples — something SPSS Chart Builder handles poorly

Content from Statistical Analysis in R


Last updated on 2026-04-22 | Edit this page

Estimated time: 120 minutes

Overview

Questions

  • How do I run t-tests, ANOVA, correlations, and regression in R?
  • How does R output compare to SPSS output tables?
  • How do I extract and report results?

Objectives

  • Run independent and paired samples t-tests
  • Conduct one-way ANOVA with post-hoc tests
  • Calculate correlations and run simple linear regression
  • Interpret R output by mapping it to familiar SPSS output tables
Cartoon of a researcher as a beach detective following a regression line in the sand while crabs carry p-values on their shells
Same tests, fewer menus, more crabs.

From SPSS dialogs to R functions


In SPSS, every statistical test lives behind a menu: Analyze > Compare Means, Analyze > Correlate, and so on. In R, each test is a single function call. The table below maps the SPSS dialogs you already know to their R equivalents:

Analysis SPSS menu path R function
Independent-samples t-test Analyze > Compare Means > Independent-Samples T Test t.test(y ~ group, data = df)
Paired-samples t-test Analyze > Compare Means > Paired-Samples T Test t.test(x, y, paired = TRUE)
One-way ANOVA Analyze > Compare Means > One-Way ANOVA aov(y ~ group, data = df) + summary()
Post-hoc tests (checkbox in ANOVA dialog) TukeyHSD()
Bivariate correlation Analyze > Correlate > Bivariate cor.test(df$x, df$y)
Linear regression Analyze > Regression > Linear lm(y ~ x1 + x2, data = df) + summary()

Let’s load our data and packages:

R

library(tidyverse)
library(broom)

visitors <- read_csv("data/aruba_visitors.csv")

Checking normality

Before running parametric tests like the t-test or ANOVA, you need to assess whether your data are approximately normally distributed. In SPSS, you would go to Analyze > Descriptive Statistics > Explore, move your variable to the “Dependent List”, click “Plots”, and check the “Normality plots with tests” box. R gives you the same tools – and more control over how they look.

Visual inspection: histogram and Q-Q plot

You already know ggplot2 from Episode 04, so let’s use it. A histogram shows the shape of the distribution, and a Q-Q (quantile-quantile) plot compares your data to a theoretical normal distribution. If the points fall roughly along the diagonal line, the data are approximately normal.

R

# Iteration: 1
# Histogram of average spending
ggplot(visitors, aes(x = avg_spending_usd)) +
  geom_histogram(bins = 20, fill = "#2a9d8f", color = "white", alpha = 0.8) +
  labs(title = "Distribution of average spending",
       x = "Average Spending (USD)",
       y = "Count") +
  theme_minimal()

R

# Iteration: 1
# Q-Q plot of average spending
ggplot(visitors, aes(sample = avg_spending_usd)) +
  stat_qq() +
  stat_qq_line(color = "#e76f51", linewidth = 0.8) +
  labs(title = "Q-Q plot of average spending",
       x = "Theoretical quantiles",
       y = "Sample quantiles") +
  theme_minimal()

Points that hug the diagonal line suggest normality. Systematic deviations – S-curves, banana shapes, or points flying off at the tails – indicate non-normality.

Shapiro-Wilk test

The Shapiro-Wilk test is the formal statistical test for normality. The null hypothesis is that the data are normally distributed, so a significant p-value (< 0.05) means you reject normality.

R

# Iteration: 1
shapiro.test(visitors$avg_spending_usd)

OUTPUT


	Shapiro-Wilk normality test

data:  visitors$avg_spending_usd
W = 0.93876, p-value = 3.534e-05

In SPSS, this same test appears in the “Tests of Normality” table when you check the normality plots option in Explore.

When does normality actually matter?

Parametric tests like the t-test and ANOVA assume normality, but they are surprisingly robust to violations of this assumption – especially with larger samples. As a rough guide:

  • With n > 30 per group, the Central Limit Theorem means the sampling distribution of the mean is approximately normal regardless of the data distribution. The t-test and ANOVA will perform well.
  • With small samples (n < 15 per group), normality matters more. Consider non-parametric alternatives (Wilcoxon test, Kruskal-Wallis) if the data are clearly non-normal.
  • Always look at the plots first. The Shapiro-Wilk test is sensitive to trivial deviations in large samples, so a significant p-value does not necessarily mean parametric tests are inappropriate.

The combination of visual inspection and the formal test gives you a complete picture – far more useful than relying on either one alone.

Challenge

Challenge: Check normality for satisfaction scores

  1. Create a histogram of satisfaction_score using ggplot2.
  2. Create a Q-Q plot of satisfaction_score.
  3. Run the Shapiro-Wilk test on satisfaction_score.
  4. Based on all three, would you feel comfortable running a parametric test on this variable? Why or why not?

R

# Iteration: 1
# 1: Histogram
ggplot(visitors, aes(x = satisfaction_score)) +
  geom_histogram(bins = 15, fill = "#2a9d8f", color = "white", alpha = 0.8) +
  labs(title = "Distribution of satisfaction scores",
       x = "Satisfaction Score",
       y = "Count") +
  theme_minimal()

R

# 2: Q-Q plot
ggplot(visitors, aes(sample = satisfaction_score)) +
  stat_qq() +
  stat_qq_line(color = "#e76f51", linewidth = 0.8) +
  labs(title = "Q-Q plot of satisfaction scores",
       x = "Theoretical quantiles",
       y = "Sample quantiles") +
  theme_minimal()

R

# 3: Shapiro-Wilk test
shapiro.test(visitors$satisfaction_score)

OUTPUT


	Shapiro-Wilk normality test

data:  visitors$satisfaction_score
W = 0.97421, p-value = 0.02086

R

# 4: Interpretation depends on the output. With n = 120, the t-test is
# robust to moderate non-normality. Check whether the histogram looks
# roughly symmetric and the Q-Q points stay close to the line.

T-tests

Independent-samples t-test

In SPSS, you would go to Analyze > Compare Means > Independent-Samples T Test, move your test variable to the “Test Variable(s)” box, move your grouping variable to the “Grouping Variable” box, and define the two groups.

In R, it is one line. Let’s test whether the average spending differs between US and Netherlands visitors:

R

# Filter to just the two groups we want to compare
us_nl <- visitors |>
  filter(origin %in% c("United States", "Netherlands"))

# Run the independent-samples t-test
t_result <- t.test(avg_spending_usd ~ origin, data = us_nl)
t_result

OUTPUT


	Welch Two Sample t-test

data:  avg_spending_usd by origin
t = -16.195, df = 37.99, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Netherlands and group United States is not equal to 0
95 percent confidence interval:
 -280.1256 -217.8744
sample estimates:
  mean in group Netherlands mean in group United States
                      978.5                      1227.5 
Callout

Reading the t-test output — SPSS comparison

The R output gives you the same information as the SPSS “Independent Samples Test” table, just arranged differently:

SPSS output column R output line
t t = ...
df df = ...
Sig. (2-tailed) p-value = ...
Mean Difference difference in means shown in estimates
95% CI of the Difference 95 percent confidence interval:

The key difference: SPSS shows Levene’s test for equality of variances automatically. R’s t.test() uses the Welch correction by default (which does not assume equal variances). This is actually the better default — many statisticians recommend always using the Welch t-test.

If you need the equal-variances version (the “Equal variances assumed” row in SPSS), add var.equal = TRUE:

R

t.test(avg_spending_usd ~ origin, data = us_nl, var.equal = TRUE)

Paired-samples t-test

A paired t-test compares two measurements on the same cases. Let’s compare Q1 vs. Q3 hotel occupancy rates (the same hotels measured in different quarters):

R

# Get Q1 and Q3 data for each year
q1_data <- visitors |>
  filter(quarter == "Q1") |>
  arrange(year, origin) |>
  pull(hotel_occupancy_pct)

q3_data <- visitors |>
  filter(quarter == "Q3") |>
  arrange(year, origin) |>
  pull(hotel_occupancy_pct)

# Paired t-test
t.test(q1_data, q3_data, paired = TRUE)

OUTPUT


	Paired t-test

data:  q1_data and q3_data
t = 3.6602, df = 29, p-value = 0.000998
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
  4.941648 17.458352
sample estimates:
mean difference
           11.2 

In SPSS this would be Analyze > Compare Means > Paired-Samples T Test, where you select the two variables as a pair.

ANOVA

One-way ANOVA

In SPSS: Analyze > Compare Means > One-Way ANOVA. Move the dependent variable to the “Dependent List” and the factor to “Factor”.

Let’s test whether satisfaction scores differ across origin countries:

R

# Fit the ANOVA model
anova_model <- aov(satisfaction_score ~ origin, data = visitors)

# View the ANOVA table
summary(anova_model)

OUTPUT

             Df Sum Sq Mean Sq F value Pr(>F)
origin        5 11.610  2.3220   35.28 <2e-16 ***
Residuals   114  7.502  0.0658
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Callout

Reading the ANOVA table — SPSS comparison

SPSS output column R output column
Sum of Squares Sum Sq
df Df
Mean Square Mean Sq
F F value
Sig. Pr(>F)

The layout is almost identical — R just uses slightly different column names.

Post-hoc tests with TukeyHSD

If the ANOVA is significant, you want to know which groups differ. In SPSS, you check the “Post Hoc” button in the ANOVA dialog and select Tukey. In R:

R

TukeyHSD(anova_model)

OUTPUT

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = satisfaction_score ~ origin, data = visitors)

$origin
                            diff        lwr         upr     p adj
Colombia-Canada           -0.495 -0.7301532 -0.25984681 0.0000002
Netherlands-Canada        -0.020 -0.2551532  0.21515319 0.9998728
Other-Canada              -0.420 -0.6551532 -0.18484681 0.0000143
United States-Canada       0.100 -0.1351532  0.33515319 0.8199046
Venezuela-Canada          -0.755 -0.9901532 -0.51984681 0.0000000
Netherlands-Colombia       0.475  0.2398468  0.71015319 0.0000007
Other-Colombia             0.075 -0.1601532  0.31015319 0.9394347
United States-Colombia     0.595  0.3598468  0.83015319 0.0000000
Venezuela-Colombia        -0.260 -0.4951532 -0.02484681 0.0211495
Other-Netherlands         -0.400 -0.6351532 -0.16484681 0.0000408
United States-Netherlands  0.120 -0.1151532  0.35515319 0.6781085
Venezuela-Netherlands     -0.735 -0.9701532 -0.49984681 0.0000000
United States-Other        0.520  0.2848468  0.75515319 0.0000001
Venezuela-Other           -0.335 -0.5701532 -0.09984681 0.0009635
Venezuela-United States   -0.855 -1.0901532 -0.61984681 0.0000000

This gives you pairwise comparisons with adjusted p-values, the difference in means, and 95% confidence intervals — the same information as the SPSS “Multiple Comparisons” table.

You can also visualize the post-hoc results:

R

plot(TukeyHSD(anova_model), las = 1, cex.axis = 0.7)

Correlation

In SPSS: Analyze > Correlate > Bivariate. Move variables to the “Variables” box and select Pearson, Spearman, or both.

Let’s test the correlation between average spending and satisfaction:

R

cor.test(visitors$avg_spending_usd, visitors$satisfaction_score)

OUTPUT


	Pearson's product-moment correlation

data:  visitors$avg_spending_usd and visitors$satisfaction_score
t = 18.69, df = 118, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8110127 0.9037614
sample estimates:
      cor
0.8645736 
Callout

Reading correlation output — SPSS comparison

SPSS output R output
Pearson Correlation cor (the estimate at the end)
Sig. (2-tailed) p-value
N shown in the data, not in output
95% CI 95 percent confidence interval

One advantage of R: cor.test() gives you a confidence interval for the correlation by default. SPSS does not show this unless you use syntax.

For a correlation matrix of multiple variables (like the SPSS correlation table), use cor():

R

visitors |>
  select(visitors_stayover, avg_stay_nights, avg_spending_usd,
         hotel_occupancy_pct, satisfaction_score) |>
  cor(use = "complete.obs") |>
  round(3)

OUTPUT

                    visitors_stayover avg_stay_nights avg_spending_usd
visitors_stayover               1.000           0.276            0.622
avg_stay_nights                 0.276           1.000            0.597
avg_spending_usd                0.622           0.597            1.000
hotel_occupancy_pct             0.231           0.166            0.206
satisfaction_score              0.589           0.674            0.865
                    hotel_occupancy_pct satisfaction_score
visitors_stayover                 0.231              0.589
avg_stay_nights                   0.166              0.674
avg_spending_usd                  0.206              0.865
hotel_occupancy_pct               1.000              0.575
satisfaction_score                0.575              1.000

Chi-square test of independence

All of the tests above work with numeric variables. When both variables are categorical, you need a different tool: the chi-square test of independence. In SPSS: Analyze > Descriptive Statistics > Crosstabs, move one variable to “Row(s)” and another to “Column(s)”, then click “Statistics” and check the “Chi-square” box.

Building a contingency table

First, create a contingency table using table(). Let’s test whether the distribution of visitor origin countries differs across quarter:

R

# Iteration: 1
# Create the contingency table
origin_quarter <- table(visitors$origin, visitors$quarter)
origin_quarter

OUTPUT


                Q1 Q2 Q3 Q4
  Canada         5  5  5  5
  Colombia       5  5  5  5
  Netherlands    5  5  5  5
  Other          5  5  5  5
  United States  5  5  5  5
  Venezuela      5  5  5  5

This is the same table you would see in the “Crosstabulation” output in SPSS. Each cell shows the observed count for that combination of origin and quarter.

Running the chi-square test

R

# Iteration: 1
chi_result <- chisq.test(origin_quarter)
chi_result

OUTPUT


	Pearson's Chi-squared test

data:  origin_quarter
X-squared = 0, df = 15, p-value = 1

The output gives you three pieces of information, matching the SPSS “Chi-Square Tests” table:

SPSS output column R output
Pearson Chi-Square value X-squared
df df
Asymp. Sig. (2-sided) p-value

A significant p-value means the distribution of one variable depends on the other – the row and column variables are not independent.

Expected vs observed frequencies

The expected frequencies tell you what the cell counts would look like if the two variables were completely independent. Comparing expected to observed is how you identify which cells are driving the result.

R

# Iteration: 1
# Expected frequencies (what you'd see under independence)
chi_result$expected |> round(1)

OUTPUT


                Q1 Q2 Q3 Q4
  Canada         5  5  5  5
  Colombia       5  5  5  5
  Netherlands    5  5  5  5
  Other          5  5  5  5
  United States  5  5  5  5
  Venezuela      5  5  5  5

R

# Iteration: 1
# Standardized residuals -- large absolute values (> 2) flag notable cells
chi_result$residuals |> round(2)

OUTPUT


                Q1 Q2 Q3 Q4
  Canada         0  0  0  0
  Colombia       0  0  0  0
  Netherlands    0  0  0  0
  Other          0  0  0  0
  United States  0  0  0  0
  Venezuela      0  0  0  0

In SPSS, you would get these by checking “Expected” and “Standardized” under the “Cells” button in the Crosstabs dialog. Residuals greater than 2 or less than -2 point to cells where observed counts deviate substantially from what independence predicts.

Challenge

Challenge: Chi-square analysis

Create a new categorical variable called spending_level that classifies rows as “High” (above the median of avg_spending_usd) or “Low” (at or below the median). Then test whether spending_level is independent of origin.

  1. Create spending_level using mutate() and if_else().
  2. Build a contingency table of origin by spending_level.
  3. Run chisq.test() and interpret the result.
  4. Examine the standardized residuals to see which origin countries spend more (or less) than expected.

R

# Iteration: 1
# 1: Create spending_level
visitors_spending <- visitors |>
  mutate(spending_level = if_else(
    avg_spending_usd > median(avg_spending_usd), "High", "Low"
  ))

# 2: Contingency table
spending_table <- table(visitors_spending$origin,
                        visitors_spending$spending_level)
spending_table

OUTPUT


                High Low
  Canada          20   0
  Colombia         0  20
  Netherlands     18   2
  Other            1  19
  United States   20   0
  Venezuela        0  20

R

# 3: Chi-square test
chi_spending <- chisq.test(spending_table)
chi_spending

OUTPUT


	Pearson's Chi-squared test

data:  spending_table
X-squared = 109, df = 5, p-value < 2.2e-16

R

# 4: Standardized residuals
chi_spending$residuals |> round(2)

OUTPUT


                 High   Low
  Canada         3.24 -3.19
  Colombia      -3.14  3.08
  Netherlands    2.60 -2.56
  Other         -2.82  2.77
  United States  3.24 -3.19
  Venezuela     -3.14  3.08

R

# Positive residuals in the "High" column mean that origin country has
# more high-spending rows than expected under independence.

Linear regression

In SPSS: Analyze > Regression > Linear. Move the dependent variable to “Dependent” and independent variables to “Independent(s)”.

Let’s predict average spending from stay nights and origin country:

R

reg_model <- lm(avg_spending_usd ~ avg_stay_nights + origin, data = visitors)
summary(reg_model)

OUTPUT


Call:
lm(formula = avg_spending_usd ~ avg_stay_nights + origin, data = visitors)

Residuals:
     Min       1Q   Median       3Q      Max
-112.239   -7.704    2.216   12.417   58.083

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)
(Intercept)          234.210     33.037   7.089 1.24e-10 ***
avg_stay_nights      134.943      4.879  27.659  < 2e-16 ***
originColombia      -184.753     12.670 -14.582  < 2e-16 ***
originNetherlands   -619.305     17.829 -34.737  < 2e-16 ***
originOther          -88.610      9.756  -9.082 4.05e-15 ***
originUnited States  114.139      6.600  17.294  < 2e-16 ***
originVenezuela     -273.788     13.452 -20.354  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.66 on 113 degrees of freedom
Multiple R-squared:  0.9937,	Adjusted R-squared:  0.9933
F-statistic:  2962 on 6 and 113 DF,  p-value: < 2.2e-16
Callout

Reading regression output — SPSS comparison

The summary() output contains everything from the SPSS regression output tables, but in a more compact format:

SPSS table R output section
Model Summary (R-sq) Multiple R-squared, Adjusted R-squared at the bottom
ANOVA table (F-test) F-statistic at the very bottom
Coefficients table The Coefficients: section
B (unstandardized) Estimate column
Std. Error Std. Error column
t t value column
Sig. Pr(>|t|) column

Note: R does not give you standardized coefficients (Beta) by default. To get those, scale your variables first with scale(), or use the lm.beta package.

Reading R output with broom::tidy()


The raw R output is fine for interactive exploration, but it is hard to export or combine with other results. The broom package converts statistical output into tidy data frames — one row per term, columns for estimate, standard error, test statistic, and p-value.

R

# Tidy the t-test result
tidy(t_result)

OUTPUT

# A tibble: 1 × 10
  estimate estimate1 estimate2 statistic  p.value parameter conf.low conf.high
     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
1     -249      978.     1228.     -16.2 1.21e-18      38.0    -280.     -218.
# ℹ 2 more variables: method <chr>, alternative <chr>

R

# Tidy the regression coefficients
tidy(reg_model)

OUTPUT

# A tibble: 7 × 5
  term                estimate std.error statistic  p.value
  <chr>                  <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)            234.      33.0       7.09 1.24e-10
2 avg_stay_nights        135.       4.88     27.7  3.93e-52
3 originColombia        -185.      12.7     -14.6  9.90e-28
4 originNetherlands     -619.      17.8     -34.7  3.86e-62
5 originOther            -88.6      9.76     -9.08 4.05e-15
6 originUnited States    114.       6.60     17.3  1.56e-33
7 originVenezuela       -274.      13.5     -20.4  1.35e-39

R

# Get model-level statistics (R-squared, F, etc.)
glance(reg_model)

OUTPUT

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.994         0.993  20.7     2962. 8.91e-122     6  -530. 1076. 1098.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The tidy() output is a regular data frame, which means you can filter it, arrange it, or export it to CSV — something that is surprisingly difficult with SPSS output. This is extremely useful for reporting: you can extract exactly the numbers you need and drop them into a table or a plot.

This is a good time to reinforce the reproducibility advantage. In SPSS, if a reviewer asks you to re-run an analysis with a different subset, you have to click through the dialogs again. In R, you change one line of code and re-run the script.

Emphasize that broom::tidy() produces a data frame — this means students can use all the dplyr verbs they learned in Episode 3 on their statistical results (filtering significant results, arranging by p-value, etc.).

A complete analysis workflow


Let’s put everything together in a workflow that mirrors what you would do in SPSS, but entirely in code. Our research question: Does average spending differ by origin country, and what predicts higher spending?

R

# Step 1: Descriptive statistics by group
visitors |>
  group_by(origin) |>
  summarise(
    n = n(),
    mean_spending = mean(avg_spending_usd),
    sd_spending = sd(avg_spending_usd),
    mean_satisfaction = mean(satisfaction_score)
  ) |>
  arrange(desc(mean_spending))

OUTPUT

# A tibble: 6 × 5
  origin            n mean_spending sd_spending mean_satisfaction
  <chr>         <int>         <dbl>       <dbl>             <dbl>
1 United States    20         1228.        48.2              8.00
2 Canada           20         1139         63.0              7.90
3 Netherlands      20          978.        49.0              7.88
4 Other            20          850         64.2              7.48
5 Colombia         20          654         68.6              7.4
6 Venezuela        20          540         46.9              7.14

R

# Step 2: Visualize the distribution
ggplot(visitors, aes(x = reorder(origin, avg_spending_usd, FUN = median),
                     y = avg_spending_usd)) +
  geom_boxplot(fill = "#2a9d8f", alpha = 0.6) +
  labs(title = "Average Spending by Origin Country",
       x = "Country of Origin",
       y = "Average Spending (USD)") +
  theme_minimal()

R

# Step 3: ANOVA --- is there a significant difference?
spending_anova <- aov(avg_spending_usd ~ origin, data = visitors)
summary(spending_anova)

OUTPUT

             Df  Sum Sq Mean Sq F value Pr(>F)
origin        5 7262707 1452541   441.7 <2e-16 ***
Residuals   114  374890    3289
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R

# Step 4: Post-hoc --- which groups differ?
tukey_results <- TukeyHSD(spending_anova)
tidy(tukey_results) |>
  filter(adj.p.value < 0.05) |>
  arrange(adj.p.value)

OUTPUT

# A tibble: 15 × 7
   term   contrast            null.value estimate conf.low conf.high adj.p.value
   <chr>  <chr>                    <dbl>    <dbl>    <dbl>     <dbl>       <dbl>
 1 origin Colombia-Canada              0   -485     -538.     -432.     2.43e-14
 2 origin Other-Canada                 0   -289     -342.     -236.     2.43e-14
 3 origin Venezuela-Canada             0   -599     -652.     -546.     2.43e-14
 4 origin Netherlands-Colomb…          0    324.     272.      377.     2.43e-14
 5 origin United States-Colo…          0    573.     521.      626.     2.43e-14
 6 origin United States-Neth…          0    249.     196.      302.     2.43e-14
 7 origin Venezuela-Netherla…          0   -438.    -491.     -386.     2.43e-14
 8 origin United States-Other          0    377.     325.      430.     2.43e-14
 9 origin Venezuela-Other              0   -310     -363.     -257.     2.43e-14
10 origin Venezuela-United S…          0   -687.    -740.     -635.     2.43e-14
11 origin Other-Colombia               0    196      143.      249.     6.49e-14
12 origin Netherlands-Canada           0   -161.    -213.     -108.     2.75e-13
13 origin Other-Netherlands            0   -128.    -181.      -75.9    1.83e- 9
14 origin Venezuela-Colombia           0   -114.    -167.      -61.4    9.17e- 8
15 origin United States-Cana…          0     88.5     35.9     141.     5.04e- 5

R

# Step 5: Regression --- what predicts spending?
spending_model <- lm(avg_spending_usd ~ avg_stay_nights + origin + year,
                     data = visitors)
tidy(spending_model) |>
  mutate(across(where(is.numeric), ~ round(., 3)))

OUTPUT

# A tibble: 8 × 5
  term                estimate std.error statistic p.value
  <chr>                  <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)         -7888.     2707.       -2.91   0.004
2 avg_stay_nights       131.        4.92     26.6    0
3 originColombia       -194.       12.6     -15.4    0
4 originNetherlands    -605.       17.9     -33.9    0
5 originOther           -94.8       9.65     -9.82   0
6 originUnited States   113.        6.38     17.8    0
7 originVenezuela      -284.       13.4     -21.1    0
8 year                    4.03      1.34      3      0.003

R

# Step 6: Model fit
glance(spending_model) |>
  select(r.squared, adj.r.squared, p.value, AIC)

OUTPUT

# A tibble: 1 × 4
  r.squared adj.r.squared   p.value   AIC
      <dbl>         <dbl>     <dbl> <dbl>
1     0.994         0.994 6.69e-122 1069.

This entire analysis — from descriptives to regression — is captured in a script that you can re-run at any time. If your data updates or a reviewer requests changes, you modify the code and run it again. No clicking through dialog boxes.

Challenge

Challenge 1: Complete analysis workflow

Conduct a full analysis to answer this question: Do satisfaction scores differ between high-season (Q1 and Q4) and low-season (Q2 and Q3) quarters?

Your workflow should include:

  1. Create a new variable season that classifies Q1/Q4 as “High” and Q2/Q3 as “Low”
  2. Compute descriptive statistics (mean and SD of satisfaction_score) for each season
  3. Create a boxplot comparing satisfaction scores by season
  4. Run an independent-samples t-test
  5. Use broom::tidy() to extract the results into a clean table

R

# Step 1: Create the season variable
visitors_season <- visitors |>
  mutate(season = if_else(quarter %in% c("Q1", "Q4"), "High", "Low"))

# Step 2: Descriptive statistics
visitors_season |>
  group_by(season) |>
  summarise(
    n = n(),
    mean_satisfaction = mean(satisfaction_score),
    sd_satisfaction = sd(satisfaction_score)
  )

OUTPUT

# A tibble: 2 × 4
  season     n mean_satisfaction sd_satisfaction
  <chr>  <int>             <dbl>           <dbl>
1 High      60              7.71           0.382
2 Low       60              7.55           0.405

R

# Step 3: Boxplot
ggplot(visitors_season, aes(x = season, y = satisfaction_score,
                             fill = season)) +
  geom_boxplot(alpha = 0.7) +
  labs(
    title = "Satisfaction Scores: High vs. Low Season",
    x = "Season",
    y = "Satisfaction Score"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

R

# Step 4: T-test
season_t <- t.test(satisfaction_score ~ season, data = visitors_season)
season_t

OUTPUT


	Welch Two Sample t-test

data:  satisfaction_score by season
t = 2.3194, df = 117.6, p-value = 0.0221
alternative hypothesis: true difference in means between group High and group Low is not equal to 0
95 percent confidence interval:
 0.02436478 0.30896855
sample estimates:
mean in group High  mean in group Low
          7.713333           7.546667 

R

# Step 5: Tidy results
tidy(season_t)

OUTPUT

# A tibble: 1 × 10
  estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
     <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
1    0.167      7.71      7.55      2.32  0.0221      118.   0.0244     0.309
# ℹ 2 more variables: method <chr>, alternative <chr>
Challenge

Challenge 2: Correlation and regression

Investigate the relationship between avg_stay_nights and avg_spending_usd:

  1. Run cor.test() to get the Pearson correlation and p-value
  2. Create a scatterplot with a linear trend line (geom_smooth(method = "lm"))
  3. Fit a linear regression predicting avg_spending_usd from avg_stay_nights
  4. Use tidy() and glance() to extract the results
  5. Interpret: for each additional night of stay, how much does spending increase?

R

# Step 1: Correlation
cor.test(visitors$avg_stay_nights, visitors$avg_spending_usd)

OUTPUT


	Pearson's product-moment correlation

data:  visitors$avg_stay_nights and visitors$avg_spending_usd
t = 8.0872, df = 118, p-value = 6.065e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4680186 0.7013374
sample estimates:
      cor
0.5971648 

R

# Step 2: Scatterplot
ggplot(visitors, aes(x = avg_stay_nights, y = avg_spending_usd)) +
  geom_point(aes(color = origin), alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", color = "black", linewidth = 0.8) +
  labs(
    title = "Relationship Between Length of Stay and Spending",
    x = "Average Stay (Nights)",
    y = "Average Spending (USD)",
    color = "Origin"
  ) +
  theme_minimal()

OUTPUT

`geom_smooth()` using formula = 'y ~ x'

R

# Step 3: Regression
stay_model <- lm(avg_spending_usd ~ avg_stay_nights, data = visitors)

# Step 4: Extract results
tidy(stay_model)

OUTPUT

# A tibble: 2 × 5
  term            estimate std.error statistic  p.value
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        430.      60.8       7.07 1.16e-10
2 avg_stay_nights     75.3      9.31      8.09 6.06e-13

R

glance(stay_model)

OUTPUT

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.357         0.351  204.      65.4 6.06e-13     1  -807. 1621. 1629.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

R

# Step 5: Interpretation
# The coefficient for avg_stay_nights tells us:
# for each additional night of stay, average spending increases by
# approximately the "estimate" value in USD, holding all else constant.
Key Points
  • Every SPSS statistical test has a direct R equivalent, usually in a single function call
  • R output is more compact than SPSS — broom::tidy() converts it to a clean table
  • The workflow in R is: load data, run test, extract results, visualize — all in a script

Content from Reproducible Reporting


Last updated on 2026-04-22 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How do I combine my analysis and write-up in one document?
  • What is R Markdown and why is it better than copy-pasting from SPSS output?
  • How do I create a report that updates when data changes?

Objectives

  • Create an R Markdown document that combines text, code, and output
  • Generate tables and figures that update automatically
  • Export reports to Word, PDF, or HTML
  • Understand why script-based reporting is more reliable than SPSS output export
Cartoon of a researcher pressing a big red button on a tropical machine that converts raw data into a finished report
Your supervisor changed the sample. Again. Good thing you only need one button.

The problem with copy-paste


If you have used SPSS for reporting, this workflow will feel very familiar:

  1. Run your analysis in SPSS
  2. Get a table or chart in the Output window
  3. Copy it
  4. Paste it into your Word document
  5. Write your interpretation around it
  6. Your supervisor / colleague sends updated data
  7. Go back to step 1 and redo everything

This copy-paste workflow is fragile. Every time the data changes, you need to re-run every analysis, re-copy every table, and re-paste into your document. Along the way, it is easy to accidentally paste an old table, forget to update a number in the text, or lose track of which version of the analysis matches your report.

R Markdown solves this problem. It lets you write your text and your analysis in a single file. When the data changes, you press one button and the entire report — text, tables, figures, and all the numbers in your sentences — updates automatically.

Callout

This is not just convenience — it is research integrity

When your numbers and your text live in the same document, it is physically impossible for them to get out of sync. This matters for policy reports, academic papers, and any situation where someone else relies on your numbers.

R Markdown basics


An R Markdown file is a plain text file with the extension .Rmd. It has three types of content:

  1. A YAML header at the top (metadata about the document)
  2. Markdown text (your writing)
  3. Code chunks (your R analysis)

Let us look at each one.

The YAML header

Every R Markdown document starts with a block between --- lines. This is the YAML header, and it controls the document settings:

---
title: "Aruba Tourism Analysis Q1 2023"
author: "Your Name"
date: "2026-04-22"
output: word_document
---

The output line controls the format of your final document:

Output format What you get
word_document A .docx Word file
html_document A web page
pdf_document A PDF (requires LaTeX)

For most government and policy work, word_document is the most practical choice — your colleagues can open it, comment on it, and print it without installing anything.

Callout

Start with Word, explore later

We recommend word_document for this course because it fits the workflow most SPSS users already have. Once you are comfortable, try html_document — it supports interactive tables and plots.

Markdown text formatting

Between your code chunks, you write normal text using Markdown — a simple way to format text with plain characters. Here are the essentials:

# First-level heading
## Second-level heading
### Third-level heading

**bold text**
*italic text*

- Bullet point one
- Bullet point two

1. Numbered item one
2. Numbered item two

[Link text](https://example.com)

That is all you need for most reports. If you have used WhatsApp or Slack formatting, this will feel familiar.

Code chunks

A code chunk is where your R code lives. It starts with ```{r} and ends with ```:

```{r}
library(tidyverse)
visitors <- read_csv("data/aruba_visitors.csv")
summary(visitors$visitors_stayover)
```

When you knit the document, R runs the code and places the output directly into your report. No copying, no pasting.

Chunk options

You can control what appears in the final document by adding options to the chunk header:

```{r, echo = FALSE, message = FALSE, warning = FALSE}
library(tidyverse)
visitors <- read_csv("data/aruba_visitors.csv")
```
Option What it does
echo = FALSE Hides the code, shows only the output
message = FALSE Suppresses package loading messages
warning = FALSE Suppresses warnings
eval = FALSE Shows the code but does not run it
fig.width = 8 Sets figure width in inches
fig.height = 5 Sets figure height in inches

For a polished report aimed at a non-technical audience, you will typically set echo = FALSE so readers see results but not code.

Inline R code

This is the feature that makes R Markdown truly powerful. You can embed R calculations directly inside your sentences:

The dataset contains `r nrow(visitors)` observations.

When knitted, this becomes:

The dataset contains 168 observations.

If the data changes and you re-knit, that number updates automatically. No more manually searching through your Word document to find every number that needs updating.

Knitting: from .Rmd to a finished document

To turn your .Rmd file into a Word document (or HTML, or PDF), you knit it. In RStudio:

  1. Click the Knit button at the top of the editor (the ball of yarn icon)
  2. R runs all your code chunks from top to bottom in a clean environment
  3. The finished document appears

The most common issue learners face is that knitting fails because the .Rmd file does not load packages or data that earlier chunks depend on. Remind participants that knitting starts from a blank environment — every package and dataset must be loaded within the .Rmd file itself, even if it is already loaded in their current R session.

Another common issue: file paths. If participants use read_csv("data/aruba_visitors.csv"), the working directory during knitting is the folder where the .Rmd file is saved. Make sure the data file is in the right relative location.

Callout

Knitting runs everything fresh

A common mistake is to rely on objects you created in your R console but never included in the .Rmd file. When you knit, R starts with a completely empty workspace. If you get an error like “object not found,” it usually means you forgot to include the code that creates that object in your .Rmd file.

Building a mini-report


Let us build a short tourism analysis report step by step. In RStudio:

  1. Go to File > New File > R Markdown…
  2. Enter a title like “Aruba Tourism Report”
  3. Enter your name as author
  4. Select Word as the default output format
  5. Click OK

RStudio gives you a template document. Delete everything below the YAML header and replace it with the following sections.

Step 1: Setup chunk

The first code chunk in any report should load your packages and data. We hide the code and messages because the reader does not need to see them.

R

# This would be at the top of your .Rmd file (after the YAML header):

# ```{r setup, message = FALSE, warning = FALSE, echo = FALSE}
library(tidyverse)
visitors <- read_csv("data/aruba_visitors.csv")
# ```

Step 2: Write an introduction in Markdown

Below the setup chunk, write some context in plain Markdown:

## Introduction

This report summarizes visitor arrivals to Aruba for the period
2019--2023. Data are drawn from quarterly records of stay-over
and cruise visitors by country of origin.

Step 3: A summary table

Now add a code chunk that produces a summary table. The knitr::kable() function turns a data frame into a nicely formatted table in your output document:

R

visitors <- read_csv("data/aruba_visitors.csv")

annual_summary <- visitors |>
  group_by(year) |>
  summarise(
    total_stayover = sum(visitors_stayover),
    total_cruise = sum(visitors_cruise),
    avg_spending = round(mean(avg_spending_usd), 0),
    avg_satisfaction = round(mean(satisfaction_score), 1)
  )

knitr::kable(annual_summary, col.names = c(
  "Year", "Stay-over Visitors", "Cruise Visitors",
  "Avg Spending (USD)", "Avg Satisfaction"
))
Year Stay-over Visitors Cruise Visitors Avg Spending (USD) Avg Satisfaction
2019 440650 195950 919 7.7
2020 208480 61200 821 7.3
2021 357200 121080 888 7.5
2022 443700 203720 924 7.8
2023 463700 213020 938 7.8

Step 4: A visualization

Add another chunk with a ggplot2 chart:

R

visitors <- read_csv("data/aruba_visitors.csv")

annual_stayover <- visitors |>
  group_by(year) |>
  summarise(total_stayover = sum(visitors_stayover))

ggplot(annual_stayover, aes(x = year, y = total_stayover)) +
  geom_col(fill = "#2c7bb6") +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Total Stay-over Visitors to Aruba by Year",
    x = "Year",
    y = "Stay-over visitors"
  ) +
  theme_minimal()

Step 5: Interpretation with inline R

Now write a paragraph that uses inline R to insert numbers directly:

In your .Rmd file, you would write something like this:

The strongest year for stay-over tourism was `r best_year$year`,
with `r scales::comma(best_year$total_stayover)` visitors.
The weakest year was `r worst_year$year`, with
`r scales::comma(worst_year$total_stayover)` visitors.

When knitted, this becomes a complete sentence with real numbers — numbers that update automatically if the data ever changes.

Step 6: Knit

Click the Knit button. RStudio generates a Word document with your introduction, table, chart, and interpretation — all in one step, all from one file.

Callout

Compare this to the SPSS workflow

Think about what you just did: you loaded data, computed a summary, created a chart, and wrote an interpretation with automatically-calculated numbers — all in a single file. If the data file is updated next quarter, you re-knit and the entire report updates. No manual copy-paste. No version confusion.

A complete example


Here is what a minimal but complete .Rmd file looks like, all in one place:

---
title: "Aruba Tourism Quarterly Report"
author: "Your Name"
date: "`r Sys.Date()`"
output: word_document
---

```{r setup, message = FALSE, warning = FALSE, echo = FALSE}
library(tidyverse)
visitors <- read_csv("data/aruba_visitors.csv")
```

## Overview

This report summarizes Aruba visitor statistics for 2019--2023.

```{r summary-table, echo = FALSE}
annual <- visitors |>
  group_by(year) |>
  summarise(
    stayover = sum(visitors_stayover),
    cruise = sum(visitors_cruise)
  )

knitr::kable(annual, col.names = c("Year", "Stay-over", "Cruise"))
```

```{r trend-chart, echo = FALSE, fig.width = 7, fig.height = 4}
ggplot(annual, aes(x = year, y = stayover)) +
  geom_line(linewidth = 1, colour = "#2c7bb6") +
  geom_point(size = 3, colour = "#2c7bb6") +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Stay-over Visitors by Year", x = "Year", y = "Visitors") +
  theme_minimal()
```

## Key Findings

```{r findings, echo = FALSE}
latest <- annual |> filter(year == max(year))
previous <- annual |> filter(year == max(year) - 1)
change_pct <- round((latest$stayover - previous$stayover) /
                       previous$stayover * 100, 1)
```

In `r latest$year`, Aruba received
`r scales::comma(latest$stayover)` stay-over visitors, a
`r change_pct`% change compared to the previous year.
Challenge

Challenge 1: Create a tourism report

Create a new R Markdown document (File > New File > R Markdown) and build a short report that does the following:

  1. Loads data/aruba_visitors.csv
  2. Produces a summary table showing average spending and satisfaction by country of origin (using knitr::kable())
  3. Creates a bar chart of average spending by origin
  4. Includes at least one inline R value in a written interpretation sentence (for example, which origin market has the highest average spending)
  5. Knits to Word

Here is one way to approach it. Your file would look like this:

---
title: "Aruba Visitor Spending by Origin"
author: "Your Name"
date: "`r Sys.Date()`"
output: word_document
---

```{r setup, message = FALSE, warning = FALSE, echo = FALSE}
library(tidyverse)
visitors <- read_csv("data/aruba_visitors.csv")
```

## Visitor Spending by Country of Origin

```{r origin-table, echo = FALSE}
origin_summary <- visitors |>
  group_by(origin) |>
  summarise(
    avg_spending = round(mean(avg_spending_usd), 0),
    avg_satisfaction = round(mean(satisfaction_score), 1)
  ) |>
  arrange(desc(avg_spending))

knitr::kable(origin_summary, col.names = c(
  "Origin", "Avg Spending (USD)", "Avg Satisfaction"
))
```

```{r origin-chart, echo = FALSE, fig.width = 7, fig.height = 4}
ggplot(origin_summary, aes(x = reorder(origin, avg_spending),
                            y = avg_spending)) +
  geom_col(fill = "#2c7bb6") +
  coord_flip() +
  labs(
    title = "Average Visitor Spending by Country of Origin",
    x = "Country of Origin",
    y = "Average Spending (USD)"
  ) +
  theme_minimal()
```

## Interpretation

```{r top-spender, echo = FALSE}
top <- origin_summary |> slice(1)
```

The highest-spending visitor segment is from `r top$origin`,
with an average spend of $`r top$avg_spending` USD per trip.
This group also has an average satisfaction score of
`r top$avg_satisfaction` out of 10.

Capstone: a full report from one command


The capstone of this course is a single R Markdown file that produces a polished, multi-page PDF report at the press of the Knit button. The worked example is the Xander Bogaerts baseball statistics report you saw in Episode 1’s opening — the same file, opened now with the skills from this episode in hand.

The full template lives at episodes/files/xander-bogaerts-report-template.Rmd. The instructor will walk through its structure: a YAML header, a hidden setup chunk, a small number of analysis chunks, and prose with inline R for the narrative numbers. Everything above in this episode is inside that template — the capstone is assembly, not new material.

Key Points
  • R Markdown combines your analysis and write-up in a single document
  • When data changes, re-knitting updates every table and figure automatically
  • You can output to Word, PDF, or HTML from the same source file
  • This eliminates the copy-paste errors that are common with SPSS output

Content from Where to Go from Here


Last updated on 2026-04-22 | Edit this page

Estimated time: 45 minutes

Overview

Questions

  • Where do I find help when I get stuck?
  • What resources are available for continued learning?
  • How do I access Dutch Caribbean data directly from R?

Objectives

  • Know where to find help: documentation, community forums, Stack Overflow
  • Access Dutch Caribbean data using R packages (cbsodataR, WDI)
  • Build a personal library of R scripts that replace SPSS workflows
  • Understand the Carpentries community and further learning paths
Cartoon of a researcher on a sailboat studying a treasure map of R learning destinations like Packages Cove and Tidyverse Peak
You learned the basics. The map has a lot more islands.

Getting help


Everyone gets stuck. The difference between a beginner and an experienced R user is not that experienced users never see errors — it is that they know where to look when they do. Here are the most useful help resources, in order of how quickly they give you an answer.

Built-in documentation

Every R function has a help page. Access it in two ways:

R

# These two are equivalent:
?mean
help("mean")

The help page shows what arguments the function takes, what it returns, and usually includes examples at the bottom. The examples are often the most useful part — scroll down to them first.

Callout

Tip: run the examples

At the bottom of most help pages there is an “Examples” section. You can run all of them at once with:

R

example(mean)

This is a fast way to see what a function does without reading the full documentation.

Posit cheat sheets

Posit (the company behind RStudio) publishes excellent two-page cheat sheets for the most popular packages. Print these out or keep them open on a second screen:

  • Data import: readr, readxl
  • Data transformation: dplyr
  • Visualization: ggplot2
  • R Markdown

Find them all at: https://posit.co/resources/cheatsheets/

R Graph Gallery (https://r-graph-gallery.com) is a searchable catalogue of ggplot2 examples with the full code for each one. When you know roughly what kind of chart you want but not the syntax to build it, this is where to start.

Stack Overflow

Stack Overflow has a massive collection of R questions and answers. When searching, add [r] to your search to filter for R-specific content:

site:stackoverflow.com [r] how to rename columns dplyr

Before posting your own question, search first — most beginner questions have already been answered. When you do post, include a minimal reproducible example (a small piece of code that someone else can run to see your problem).

Posit Community

The Posit Community forum (https://community.rstudio.com) is a friendlier, more focused alternative to Stack Overflow. It is specifically for R and RStudio questions, and the community is welcoming to beginners.

R-bloggers

R-bloggers aggregates hundreds of R blogs. It is a great place to discover tutorials, new packages, and practical examples. You can subscribe to the email newsletter to get a daily digest.

Dutch Caribbean data in R


One of the practical advantages of R over SPSS is that you can pull data directly from online sources into your R session — no manual downloads, no saving .sav files. Here are three sources especially relevant to the Dutch Caribbean.

CBS Netherlands with cbsodataR

The cbsodataR package connects directly to CBS (Statistics Netherlands) StatLine, which includes data for the BES islands (Bonaire, Sint Eustatius, Saba) and sometimes Aruba, Curacao, and Sint Maarten.

R

# Install the package (only needed once):
install.packages("cbsodataR")

R

library(cbsodataR)

# Browse available tables (there are thousands):
tables <- cbs_get_datasets()

# Search for Caribbean Netherlands tables:
caribbean_tables <- tables |>
  dplyr::filter(grepl("Caribisch|Caribbean|Bonaire", Title, ignore.case = TRUE))

# View what we found:
head(caribbean_tables[, c("Identifier", "Title")], 10)

Once you find a table of interest, you can download it directly:

R

# Example: population data for Caribbean Netherlands
# (Check cbs_get_datasets() for current table identifiers)
pop_data <- cbs_get_data("83698ENG")

head(pop_data)
Callout

Finding the right table

CBS has thousands of tables, and the identifiers (like “83698ENG”) change over time. The safest approach is to:

  1. Search on opendata.cbs.nl in your browser
  2. Find the table you want
  3. Copy the identifier from the URL
  4. Use that identifier in cbs_get_data()

World Bank data with WDI

The WDI package pulls data from the World Bank’s World Development Indicators. This is excellent for comparing Aruba, Curacao, and Sint Maarten to other small island states.

R

# Install the package (only needed once):
install.packages("WDI")

Here is a working example that pulls GDP per capita data for Aruba:

R

library(WDI)

# Pull GDP per capita (current US$) for Aruba
# Country code for Aruba is "ABW"
aruba_gdp <- WDI(
  country = "ABW",
  indicator = "NY.GDP.PCAP.CD",
  start = 2000,
  end = 2023
)

# Clean up column names
names(aruba_gdp)[names(aruba_gdp) == "NY.GDP.PCAP.CD"] <- "gdp_per_capita"

# Show the most recent years
tail(aruba_gdp[, c("year", "gdp_per_capita")], 10)

OUTPUT

   year gdp_per_capita
15 2009       25134.77
16 2008       28171.91
17 2007       26736.31
18 2006       24845.66
19 2005       24171.84
20 2004       23700.63
21 2003       21949.49
22 2002       21307.25
23 2001       20740.13
24 2000       20681.02

You can also compare multiple countries at once:

R

library(WDI)
library(ggplot2)

# Compare Aruba (ABW), Curacao (CUW), and Sint Maarten (SXM)
island_gdp <- WDI(
  country = c("ABW", "CUW", "SXM"),
  indicator = "NY.GDP.PCAP.CD",
  start = 2000,
  end = 2023
)

names(island_gdp)[names(island_gdp) == "NY.GDP.PCAP.CD"] <- "gdp_per_capita"

ggplot(island_gdp, aes(x = year, y = gdp_per_capita, colour = country)) +
  geom_line(linewidth = 1) +
  labs(
    title = "GDP per Capita: Dutch Caribbean Comparison",
    x = "Year",
    y = "GDP per Capita (Current USD)",
    colour = "Country"
  ) +
  theme_minimal()
Callout

How to find World Bank indicator codes

The easiest way to find indicator codes is to use WDIsearch():

R

# Search for indicators related to tourism:
WDIsearch("tourism")

# Search for indicators related to GDP:
WDIsearch("GDP per capita")

You can also browse indicators at https://data.worldbank.org/indicator.

Direct downloads from CBS Aruba

CBS Aruba (https://cbs.aw) publishes data in Excel and PDF format. While there is no dedicated R package, you can download and read Excel files directly:

R

library(readxl)

# Download an Excel file from CBS Aruba to a temporary location:
url <- "https://cbs.aw/wp/index.php/download-file/?example.xlsx"
temp_file <- tempfile(fileext = ".xlsx")
download.file(url, temp_file, mode = "wb")

# Read it into R:
aruba_data <- read_excel(temp_file)

The same approach works for CBS Curacao (https://www.cbs.cw).

Open datasets from the University of Aruba

The University of Aruba maintains a small set of public reference datasets relevant to research on the Dutch Caribbean and small island states. Both are CSV files hosted on GitHub, so you can pull them into R directly without any package install.

Dutch Caribbean election results at github.com/University-of-Aruba/CAS_election_data. A tidy-format dataset of votes by party for elections across Aruba (1985-2024), Curacao (2010-2025), and Sint Maarten (2010-2024). One row per party per election, with columns for year, country, party, and votes.

R

# Read the CSV directly from GitHub
elections <- read.csv(
  "https://raw.githubusercontent.com/University-of-Aruba/CAS_election_data/main/dutch_caribbean_elections.csv"
)

# Combined vote of the two main Aruban parties versus all others over time
library(dplyr)
elections |>
  filter(country == "Aruba") |>
  mutate(group = ifelse(party %in% c("MEP", "AVP"), "MEP & AVP", "Other")) |>
  group_by(year, group) |>
  summarise(votes = sum(votes), .groups = "drop") |>
  arrange(year, group)

Small island reference list at github.com/University-of-Aruba/island-research-reference-data. A country and territory list with classifications for SIDS, SNIJ, and World Bank region and income group. Designed for XLSForm survey tools (KoboToolbox, ODK), but equally useful as a lookup table to filter or join against any other dataset by country code.

R

# Filter to only Small Island Developing States
library(dplyr)

countries <- read.csv(
  "https://raw.githubusercontent.com/University-of-Aruba/island-research-reference-data/main/countries/countries_reference_xlsform.csv"
)

sids <- countries |>
  filter(is_sids == 1) |>
  select(iso_code, label, wb_region, wb_income_group)

head(sids)

Building a personal script library


Over time, you will write scripts that solve specific problems: cleaning a particular dataset, running a standard analysis, producing a recurring report. Do not let these scripts disappear. Build a personal library.

Practical tips

  1. Save every analysis as a .R or .Rmd file — never rely on your console history alone.

  2. Use clear file names that describe what the script does:

    01_clean_visitor_data.R
    02_quarterly_summary_report.Rmd
    03_spending_by_origin_analysis.R
  3. Comment generously. You will forget why you wrote something in three months. Write comments that explain the why, not just the what:

R

# Remove 2020 Q2 because lockdown made the data non-comparable
visitors_clean <- visitors |>
  filter(!(year == 2020 & quarter == "Q2"))
  1. Create a project folder structure:

    my-project/
    ├── data/          # Raw data files (never edit these)
    ├── scripts/       # R scripts for data cleaning and analysis
    ├── output/        # Generated reports and figures
    └── README.txt     # What this project is about
  2. Use RStudio Projects (File > New Project) to keep everything together. RStudio Projects set your working directory automatically and keep your files organized.

Callout

Your scripts are your institutional memory

In many small-island government offices, knowledge walks out the door when staff transfer or retire. If your analyses live in commented scripts, the next person can pick up exactly where you left off — even if they only know basic R.

Continued learning


You have covered a lot of ground in this course. Here is where to go next to keep building your skills.

Free books and courses

  • R for Data Science (2nd edition) by Hadley Wickham, Mine Cetinkaya-Rundel, and Garrett Grolemund. Free online at https://r4ds.hadley.nz. This is the single best next step — it covers everything in this course in more depth, plus much more.

  • The Carpentries offers free, community-taught workshops on R, Python, Git, and more. Lessons are at https://carpentries.org/community-lessons/.

  • Posit Cloud (https://posit.cloud) gives you RStudio in your browser with free interactive primers. Good for practicing without installing anything.

Community

  • DCDC Network — the Dutch Caribbean Data Community is your regional peer network. If you are taking this course, you are already part of it. Use the network to share scripts, ask questions, and collaborate on data projects across the islands.

  • Posit Community (https://community.rstudio.com) — ask questions, share your work, help others. The best way to learn is to teach.

  • TidyTuesday — a weekly community data visualization challenge. Every Tuesday, a new dataset is posted and people share their analyses. Great practice and inspiration: https://github.com/rfordatascience/tidytuesday

Between-session practice assignment


The between-session practice assignment is a separate page so it can be handed to you at the end of Day 1 and opened from any device overnight. See the Homework brief under the “For Learners” menu of this site.

Challenge

Challenge 1: Pull World Bank data and visualize it

Use the WDI package to pull a World Bank indicator for Aruba and create a basic plot. Follow these steps:

  1. Load the WDI and ggplot2 packages
  2. Use WDIsearch() to find an indicator that interests you (for example, international tourism arrivals, unemployment, or population)
  3. Use WDI() to download the data for Aruba (country code "ABW")
  4. Create a line plot showing how the indicator changes over time
  5. Add a meaningful title and axis labels

Here is an example using international tourism arrivals:

R

library(WDI)
library(ggplot2)

# Search for tourism-related indicators:
WDIsearch("international tourism, number of arrivals")

# Pull international tourism arrivals for Aruba
# Indicator: ST.INT.ARVL = International tourism, number of arrivals
aruba_tourism <- WDI(
  country = "ABW",
  indicator = "ST.INT.ARVL",
  start = 2000,
  end = 2023
)

# Rename the indicator column for clarity
names(aruba_tourism)[names(aruba_tourism) == "ST.INT.ARVL"] <- "arrivals"

# Remove rows where arrivals is missing
aruba_tourism <- aruba_tourism[!is.na(aruba_tourism$arrivals), ]

# Create the plot
ggplot(aruba_tourism, aes(x = year, y = arrivals)) +
  geom_line(linewidth = 1, colour = "#2c7bb6") +
  geom_point(size = 2, colour = "#2c7bb6") +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "International Tourism Arrivals to Aruba",
    subtitle = "Source: World Bank World Development Indicators",
    x = "Year",
    y = "Number of Arrivals"
  ) +
  theme_minimal()

Your indicator and chart will look different depending on which indicator you chose — that is fine. The key skills are: searching for an indicator, downloading the data, and visualizing it.

Callout

You are ready

You now know how to import data, transform it, visualize it, test hypotheses, and produce automated reports — all in R. That is not everything, but it is a solid foundation. The most important thing now is to use it. The next time you need to analyze data, try doing it in R instead of SPSS. You will be slower at first, but each time it gets easier. And everything you do is reproducible, shareable, and transparent.

Before you leave


Please take a few minutes to complete these short surveys. Your feedback helps us improve the course and strengthens the DCDC Network.

Course evaluation — tell us what worked, what didn’t, and what you’d change. This directly shapes future sessions.

Complete the course evaluation survey

DCDC Network onboarding — help us understand your data practices and connect you to the broader network. This feeds into DCDC research on digital competence across the Dutch Caribbean.

Complete the DCDC Network onboarding survey

Key Points
  • R has a large, active community — you are never stuck alone
  • cbsodataR and WDI let you pull Dutch Caribbean data directly into R
  • Save your analyses as scripts and build a personal reference library
  • The DCDC Network is your regional peer community for continued learning