Data Manipulation
Last updated on 2026-04-22 | Edit this page
Estimated time: 90 minutes
Overview
Questions
- How do I filter, sort, and recode data in R the way I do in SPSS?
- What is the tidyverse and why does it matter?
- How do I create new variables from existing ones?
Objectives
- Filter rows, select columns, and sort data using dplyr
- Create new variables and recode existing ones
- Chain operations together using the pipe operator
- Recognize the SPSS menu equivalent for each operation

The tidyverse approach
In SPSS, you manipulate data through menus: Data > Select
Cases, Data > Sort Cases, Transform
> Compute Variable, and so on. In R, the dplyr
package gives you a set of verbs — functions with
plain-English names that do exactly what they say.
The dplyr package is part of the
tidyverse, which you already loaded in the previous
episode. If you are starting a fresh R session, load it now:
R
library(tidyverse)
OUTPUT
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
R
# Also load our dataset
visitors <- read_csv("data/aruba_visitors.csv")
OUTPUT
Rows: 120 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): quarter, origin
dbl (7): year, visitors_stayover, visitors_cruise, avg_stay_nights, avg_spen...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
What is a dependency?
A package in R is a bundle of code that someone else wrote so you do
not have to. tidyverse, for example, is actually a
collection of packages that work together for data manipulation and
visualisation. When you use library(tidyverse), R loads
those packages into your session so their functions become
available.
A dependency is a package that another package needs in order to
work. tidyverse depends on dplyr,
ggplot2, readr, and several others. When you
run install.packages("tidyverse"), R automatically installs
everything it depends on too. You do not have to manage the chain
manually.
Two things this means in practice. First, the first time you install
a package it can take a minute or two because R is pulling down the
dependency chain. That is normal. Second, when you share your script
with a colleague and they get an error like
there is no package called 'dplyr', the fix is almost
always install.packages("tidyverse"), not
install.packages("dplyr"), because the dependency lives
inside the larger package.
We come back to this in Episode 6 (reproducible reporting), where recording which packages your script needs is part of making sure it still runs six months from now.
Here is the key idea: every SPSS menu operation you use for data manipulation has a dplyr verb equivalent.
| SPSS menu path | dplyr verb | What it does |
|---|---|---|
| Data > Select Cases | filter() |
Keep rows that match a condition |
| (selecting columns in Variable View) | select() |
Keep or drop columns |
| Transform > Compute Variable | mutate() |
Create or modify a column |
| Transform > Recode into Different Variables | case_when() |
Assign values based on conditions |
| Data > Sort Cases | arrange() |
Sort rows |
| Data > Split File + Aggregate |
group_by() + summarise()
|
Calculate summaries by group |
Let us work through each one.
filter() — Select Cases
In SPSS, you would go to Data > Select Cases,
click “If condition is satisfied”, and type a condition like
origin = "United States". In R:
R
us_visitors <- filter(visitors, origin == "United States")
us_visitors
OUTPUT
# A tibble: 20 × 9
year quarter origin visitors_stayover visitors_cruise avg_stay_nights
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2019 Q1 United States 72450 45200 6.8
2 2019 Q2 United States 65200 38100 6.5
3 2019 Q3 United States 58900 32400 6.2
4 2019 Q4 United States 69800 42800 6.7
5 2020 Q1 United States 68100 40200 6.6
6 2020 Q2 United States 4200 0 5.8
7 2020 Q3 United States 18500 0 6
8 2020 Q4 United States 42100 8200 6.3
9 2021 Q1 United States 48200 12400 6.4
10 2021 Q2 United States 52800 18600 6.5
11 2021 Q3 United States 55400 25200 6.3
12 2021 Q4 United States 64200 38500 6.6
13 2022 Q1 United States 74800 46800 6.9
14 2022 Q2 United States 67500 39200 6.6
15 2022 Q3 United States 61200 34600 6.3
16 2022 Q4 United States 72100 44500 6.8
17 2023 Q1 United States 78200 48900 7
18 2023 Q2 United States 70100 41200 6.7
19 2023 Q3 United States 63500 36200 6.4
20 2023 Q4 United States 75600 46100 6.9
# ℹ 3 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
# satisfaction_score <dbl>
Notice the double equals sign ==. This
is how R tests equality. A single = is for assigning values
to function arguments (like digits = 1); a double
== asks “is this equal to?”.
You can combine conditions:
R
# US visitors in 2023 only
us_2023 <- filter(visitors, origin == "United States", year == 2023)
us_2023
OUTPUT
# A tibble: 4 × 9
year quarter origin visitors_stayover visitors_cruise avg_stay_nights
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2023 Q1 United States 78200 48900 7
2 2023 Q2 United States 70100 41200 6.7
3 2023 Q3 United States 63500 36200 6.4
4 2023 Q4 United States 75600 46100 6.9
# ℹ 3 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
# satisfaction_score <dbl>
R
# Rows where stayover visitors exceeded 50,000
big_quarters <- filter(visitors, visitors_stayover > 50000)
big_quarters
OUTPUT
# A tibble: 16 × 9
year quarter origin visitors_stayover visitors_cruise avg_stay_nights
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2019 Q1 United States 72450 45200 6.8
2 2019 Q2 United States 65200 38100 6.5
3 2019 Q3 United States 58900 32400 6.2
4 2019 Q4 United States 69800 42800 6.7
5 2020 Q1 United States 68100 40200 6.6
6 2021 Q2 United States 52800 18600 6.5
7 2021 Q3 United States 55400 25200 6.3
8 2021 Q4 United States 64200 38500 6.6
9 2022 Q1 United States 74800 46800 6.9
10 2022 Q2 United States 67500 39200 6.6
11 2022 Q3 United States 61200 34600 6.3
12 2022 Q4 United States 72100 44500 6.8
13 2023 Q1 United States 78200 48900 7
14 2023 Q2 United States 70100 41200 6.7
15 2023 Q3 United States 63500 36200 6.4
16 2023 Q4 United States 75600 46100 6.9
# ℹ 3 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
# satisfaction_score <dbl>
Common comparison operators
| Operator | Meaning | Example |
|---|---|---|
== |
equal to | origin == "Canada" |
!= |
not equal to | origin != "Other" |
> |
greater than | visitors_stayover > 10000 |
< |
less than | avg_stay_nights < 5 |
>= |
greater than or equal to | year >= 2021 |
<= |
less than or equal to | satisfaction_score <= 7.5 |
%in% |
matches one of several values | origin %in% c("United States", "Canada") |
select() — Keep or drop columns
Sometimes your dataset has more columns than you need. In SPSS, you
might delete variables or simply ignore them. In R,
select() lets you keep only the columns you want:
R
# Keep only year, quarter, origin, and stayover visitors
slim <- select(visitors, year, quarter, origin, visitors_stayover)
head(slim)
OUTPUT
# A tibble: 6 × 4
year quarter origin visitors_stayover
<dbl> <chr> <chr> <dbl>
1 2019 Q1 United States 72450
2 2019 Q1 Netherlands 18300
3 2019 Q1 Venezuela 8200
4 2019 Q1 Colombia 6100
5 2019 Q1 Canada 4500
6 2019 Q1 Other 9800
You can also drop columns by putting a minus sign in front:
R
# Drop the satisfaction score column
no_satisfaction <- select(visitors, -satisfaction_score)
head(no_satisfaction)
OUTPUT
# A tibble: 6 × 8
year quarter origin visitors_stayover visitors_cruise avg_stay_nights
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2019 Q1 United States 72450 45200 6.8
2 2019 Q1 Netherlands 18300 1200 10.2
3 2019 Q1 Venezuela 8200 400 5.1
4 2019 Q1 Colombia 6100 800 4.8
5 2019 Q1 Canada 4500 3200 7.1
6 2019 Q1 Other 9800 5600 5.5
# ℹ 2 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>
mutate() — Compute Variable
In SPSS: Transform > Compute Variable. You would
type a target variable name, then an expression. In R,
mutate() creates a new column (or modifies an existing
one):
R
visitors <- mutate(visitors,
total_visitors = visitors_stayover + visitors_cruise
)
head(select(visitors, year, quarter, origin, visitors_stayover, visitors_cruise, total_visitors))
OUTPUT
# A tibble: 6 × 6
year quarter origin visitors_stayover visitors_cruise total_visitors
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2019 Q1 United States 72450 45200 117650
2 2019 Q1 Netherlands 18300 1200 19500
3 2019 Q1 Venezuela 8200 400 8600
4 2019 Q1 Colombia 6100 800 6900
5 2019 Q1 Canada 4500 3200 7700
6 2019 Q1 Other 9800 5600 15400
You can create multiple columns at once:
R
visitors <- mutate(visitors,
spending_per_night = avg_spending_usd / avg_stay_nights,
stayover_pct = visitors_stayover / total_visitors * 100
)
case_when() — Recode into Different Variables
In SPSS: Transform > Recode into Different
Variables, where you map old values to new values. In R, you
use case_when() inside mutate():
R
visitors <- mutate(visitors,
region = case_when(
origin == "United States" ~ "North America",
origin == "Canada" ~ "North America",
origin == "Netherlands" ~ "Europe",
origin == "Venezuela" ~ "South America",
origin == "Colombia" ~ "South America",
.default = "Other"
)
)
table(visitors$region)
OUTPUT
Europe North America Other South America
20 40 20 40
The syntax is: condition ~ value_to_assign. The
.default line catches everything that did not match a
previous condition — like the “Else” box in SPSS Recode.
arrange() — Sort Cases
In SPSS: Data > Sort Cases. In R:
R
# Sort by stayover visitors, ascending (default)
arrange(visitors, visitors_stayover)
OUTPUT
# A tibble: 120 × 13
year quarter origin visitors_stayover visitors_cruise avg_stay_nights
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2020 Q2 Canada 180 0 5.2
2 2020 Q2 Venezuela 200 0 3.5
3 2020 Q2 Colombia 300 0 3.8
4 2020 Q2 Other 400 0 4.1
5 2020 Q3 Venezuela 800 0 3.9
6 2020 Q2 Netherlands 1100 0 8.5
7 2020 Q3 Canada 1100 0 5.8
8 2020 Q3 Colombia 1200 0 4
9 2020 Q4 Venezuela 1800 0 4.1
10 2021 Q1 Venezuela 2100 50 4
# ℹ 110 more rows
# ℹ 7 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
# satisfaction_score <dbl>, total_visitors <dbl>, spending_per_night <dbl>,
# stayover_pct <dbl>, region <chr>
R
# Sort descending — use desc()
arrange(visitors, desc(visitors_stayover))
OUTPUT
# A tibble: 120 × 13
year quarter origin visitors_stayover visitors_cruise avg_stay_nights
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2023 Q1 United States 78200 48900 7
2 2023 Q4 United States 75600 46100 6.9
3 2022 Q1 United States 74800 46800 6.9
4 2019 Q1 United States 72450 45200 6.8
5 2022 Q4 United States 72100 44500 6.8
6 2023 Q2 United States 70100 41200 6.7
7 2019 Q4 United States 69800 42800 6.7
8 2020 Q1 United States 68100 40200 6.6
9 2022 Q2 United States 67500 39200 6.6
10 2019 Q2 United States 65200 38100 6.5
# ℹ 110 more rows
# ℹ 7 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
# satisfaction_score <dbl>, total_visitors <dbl>, spending_per_night <dbl>,
# stayover_pct <dbl>, region <chr>
R
# Sort by multiple columns: year first, then by stayover visitors descending
arrange(visitors, year, desc(visitors_stayover))
OUTPUT
# A tibble: 120 × 13
year quarter origin visitors_stayover visitors_cruise avg_stay_nights
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2019 Q1 United States 72450 45200 6.8
2 2019 Q4 United States 69800 42800 6.7
3 2019 Q2 United States 65200 38100 6.5
4 2019 Q3 United States 58900 32400 6.2
5 2019 Q3 Netherlands 21200 800 10.5
6 2019 Q4 Netherlands 19600 1100 10.4
7 2019 Q1 Netherlands 18300 1200 10.2
8 2019 Q2 Netherlands 15400 900 9.8
9 2019 Q1 Other 9800 5600 5.5
10 2019 Q4 Other 8900 5100 5.4
# ℹ 110 more rows
# ℹ 7 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
# satisfaction_score <dbl>, total_visitors <dbl>, spending_per_night <dbl>,
# stayover_pct <dbl>, region <chr>
group_by() + summarise() — Split File and
Aggregate
This is one of the most powerful combinations in dplyr, and it replaces two SPSS operations at once:
- Data > Split File (which tells SPSS to run analyses separately for each group)
- Data > Aggregate (which calculates summary statistics by group)
R
# Average stayover visitors per year
yearly_summary <- visitors |>
group_by(year) |>
summarise(
mean_stayover = mean(visitors_stayover),
total_stayover = sum(visitors_stayover)
)
yearly_summary
OUTPUT
# A tibble: 5 × 3
year mean_stayover total_stayover
<dbl> <dbl> <dbl>
1 2019 18360. 440650
2 2020 8687. 208480
3 2021 14883. 357200
4 2022 18488. 443700
5 2023 19321. 463700
Wait — what is that |> symbol? That is the
pipe operator, and it deserves its own section.
The pipe operator |>
The pipe |> is one of the most important ideas in
modern R. Read it as “and then”. It takes the result of
the expression on the left and passes it as the first argument to the
function on the right.
Without the pipe, you would write:
R
# Nested style (hard to read)
summarise(group_by(filter(visitors, origin == "United States"), year), mean_spend = mean(avg_spending_usd))
That is like reading a sentence from the inside out. With the pipe, the same code becomes:
R
visitors |>
filter(origin == "United States") |>
group_by(year) |>
summarise(mean_spend = mean(avg_spending_usd))
OUTPUT
# A tibble: 5 × 2
year mean_spend
<dbl> <dbl>
1 2019 1225
2 2020 1170
3 2021 1215
4 2022 1252.
5 2023 1275
Read this as: “Take visitors, and then
filter to US rows, and then group by year, and
then summarise mean spending.”
The pipe makes your code read from top to bottom, like a recipe. Each line is one step.
|> vs %>%
You may see %>% in older R code and tutorials. This
is the original pipe operator from the magrittr package.
The native pipe |> was added to base R in version 4.1
(2021) and works without loading any packages. They behave almost
identically. We use |> in this course because it
requires no extra dependencies.
Building up a pipeline step by step
A good workflow is to build your pipeline one step at a time, checking the result after each line. Let us work through an example:
R
# Step 1: Start with the data
visitors |>
filter(year >= 2022)
OUTPUT
# A tibble: 48 × 13
year quarter origin visitors_stayover visitors_cruise avg_stay_nights
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2022 Q1 United States 74800 46800 6.9
2 2022 Q1 Netherlands 19100 1300 10.3
3 2022 Q1 Venezuela 4800 200 4.4
4 2022 Q1 Colombia 5900 750 4.7
5 2022 Q1 Canada 4600 3300 7.2
6 2022 Q1 Other 10200 5800 5.6
7 2022 Q2 United States 67500 39200 6.6
8 2022 Q2 Netherlands 16200 1000 10
9 2022 Q2 Venezuela 4200 150 4.3
10 2022 Q2 Colombia 5600 650 4.6
# ℹ 38 more rows
# ℹ 7 more variables: avg_spending_usd <dbl>, hotel_occupancy_pct <dbl>,
# satisfaction_score <dbl>, total_visitors <dbl>, spending_per_night <dbl>,
# stayover_pct <dbl>, region <chr>
R
# Step 2: Add a column selection
visitors |>
filter(year >= 2022) |>
select(year, quarter, origin, visitors_stayover, avg_spending_usd)
OUTPUT
# A tibble: 48 × 5
year quarter origin visitors_stayover avg_spending_usd
<dbl> <chr> <chr> <dbl> <dbl>
1 2022 Q1 United States 74800 1280
2 2022 Q1 Netherlands 19100 1000
3 2022 Q1 Venezuela 4800 570
4 2022 Q1 Colombia 5900 700
5 2022 Q1 Canada 4600 1200
6 2022 Q1 Other 10200 900
7 2022 Q2 United States 67500 1250
8 2022 Q2 Netherlands 16200 980
9 2022 Q2 Venezuela 4200 550
10 2022 Q2 Colombia 5600 690
# ℹ 38 more rows
R
# Step 3: Group and summarise
visitors |>
filter(year >= 2022) |>
group_by(origin) |>
summarise(
mean_stayover = mean(visitors_stayover),
mean_spending = mean(avg_spending_usd)
) |>
arrange(desc(mean_stayover))
OUTPUT
# A tibble: 6 × 3
origin mean_stayover mean_spending
<chr> <dbl> <dbl>
1 United States 70375 1264.
2 Netherlands 20038. 1011.
3 Other 9275 888.
4 Colombia 5600 696.
5 Venezuela 4250 552.
6 Canada 3888. 1176.
This final pipeline reads: “Take visitors, keep only 2022 and later, group by origin, calculate mean stayover visitors and mean spending, then sort by mean stayover visitors in descending order.”
- Write the pipe
|>on the whiteboard and say “and then” out loud every time you use it. This mental model sticks. - Build pipelines live, one step at a time. Run after each added line so participants can see how the output changes.
- The most common beginner mistake is putting
|>at the start of a line instead of at the end of the previous line. Emphasize: the pipe goes at the end of the line, so R knows the expression continues. - Compare nested function calls to piped code side by side. The readability advantage sells itself.
Challenge 1: US visitors analysis
Using the visitors dataset and the pipe operator, write
a pipeline that:
- Filters to only United States visitors
- Creates a new column called
total_visitorsthat addsvisitors_stayoverandvisitors_cruise - Groups by
year - Calculates the total (sum) of
total_visitorsfor each year
Save the result to an object called us_yearly and print
it. Which year had the fewest total US visitors? Why might that be?
R
us_yearly <- visitors |>
filter(origin == "United States") |>
mutate(total_visitors = visitors_stayover + visitors_cruise) |>
group_by(year) |>
summarise(total = sum(total_visitors))
us_yearly
OUTPUT
# A tibble: 5 × 2
year total
<dbl> <dbl>
1 2019 424850
2 2020 181300
3 2021 315300
4 2022 440700
5 2023 459800
2020 had by far the fewest US visitors, due to the COVID-19 pandemic and associated travel restrictions. You can see the recovery in subsequent years.
Challenge 2: Top spending origins
Write a pipeline that finds the average avg_spending_usd
for each origin country across the entire dataset, then sorts the result
from highest to lowest. Which origin country has the highest average
spending?
R
visitors |>
group_by(origin) |>
summarise(mean_spending = mean(avg_spending_usd)) |>
arrange(desc(mean_spending))
OUTPUT
# A tibble: 6 × 2
origin mean_spending
<chr> <dbl>
1 United States 1228.
2 Canada 1139
3 Netherlands 978.
4 Other 850
5 Colombia 654
6 Venezuela 540
The United States has the highest average spending per visitor, followed by Canada. This makes sense given the typical length of stay and the types of tourism activities associated with visitors from these markets.
Challenge 3: Regional summary
Using the region column we created earlier with
case_when(), write a pipeline that calculates the total
visitors_stayover and mean satisfaction_score
for each region and each year. Sort by year and then by total stayover
visitors descending.
Hint: you will need to group_by() two columns.
R
visitors |>
group_by(region, year) |>
summarise(
total_stayover = sum(visitors_stayover),
mean_satisfaction = mean(satisfaction_score),
.groups = "drop"
) |>
arrange(year, desc(total_stayover))
OUTPUT
# A tibble: 20 × 4
region year total_stayover mean_satisfaction
<chr> <dbl> <dbl> <dbl>
1 North America 2019 280950 8
2 Europe 2019 74500 7.92
3 South America 2019 50500 7.45
4 Other 2019 34700 7.5
5 North America 2020 141180 7.56
6 Europe 2020 35500 7.48
7 Other 2020 16600 7.2
8 South America 2020 15200 6.98
9 North America 2021 233500 7.88
10 Europe 2021 65600 7.82
11 South America 2021 29800 7.15
12 Other 2021 28300 7.4
13 North America 2022 290800 8.1
14 Europe 2022 78100 8.02
15 South America 2022 38500 7.34
16 Other 2022 36300 7.6
17 North America 2023 303300 8.19
18 Europe 2023 82200 8.12
19 South America 2023 40300 7.44
20 Other 2023 37900 7.68
The .groups = "drop" argument tells
summarise() to remove the grouping after calculation.
Without it, the result would still be grouped by region,
which can cause unexpected behaviour in later steps.
Summary
You now know the six core dplyr verbs and can map each one to its SPSS equivalent:
| You used to… | Now you write… |
|---|---|
| Data > Select Cases | filter() |
| Select columns in Variable View | select() |
| Transform > Compute Variable | mutate() |
| Transform > Recode |
mutate() + case_when()
|
| Data > Sort Cases | arrange() |
| Data > Split File + Aggregate |
group_by() + summarise()
|
And you connect them all with |> — “and then” — to
build readable, reproducible data pipelines.
- dplyr verbs (
filter,select,mutate,arrange,summarise) replace SPSS menu operations - The pipe operator
|>chains operations together, making code readable -
group_by()combined withsummarise()replaces SPSS Split File + Aggregate