This case study is aimed at understanding user behavior in Cyclistic’s bike-share program by analyzing historical bike trip data. By differentiating between casual riders and annual members, we aim to derive insights to formulate targeted marketing strategies. The ultimate goal is to convert casual riders into annual members, promoting Cyclistic’s growth.
# Import necessary libraries
library(tidyverse)
library(lubridate)
# Define file paths for the datasets for each month
file_paths <- c("202305-divvy-tripdata/202305-divvy-tripdata.csv",
"202304-divvy-tripdata/202304-divvy-tripdata.csv",
"202303-divvy-tripdata/202303-divvy-tripdata.csv",
"202302-divvy-tripdata/202302-divvy-tripdata.csv",
"202301-divvy-tripdata/202301-divvy-tripdata.csv",
"202212-divvy-tripdata/202212-divvy-tripdata.csv",
"202211-divvy-tripdata/202211-divvy-tripdata.csv",
"202210-divvy-tripdata/202210-divvy-tripdata.csv",
"202209-divvy-tripdata/202209-divvy-publictripdata.csv",
"202208-divvy-tripdata/202208-divvy-tripdata.csv",
"202207-divvy-tripdata/202207-divvy-tripdata.csv",
"202206-divvy-tripdata/202206-divvy-tripdata.csv")
# Initialize an empty list to store the data frames for each month
data_list <- list()
# Initialize an empty list to store the data frames for each month
for (i in seq_along(file_paths)) {
data_list[[i]] <- read_csv(file_paths[i])
}
# Combine all monthly data frames into one master data frame "divvy"
divvy <- bind_rows(data_list)
We begin by inspecting the dataset to understand its structure, identify the types of data it contains, and check for any missing values or potential inconsistencies.
We first look at the first few rows of the dataset to get a glimpse of its structure and the nature of data it contains.
head(divvy)
## # A tibble: 6 × 13
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 0D9FA920C3062031 electric_bike 2023-05-07 19:53:48 2023-05-07 19:58:32
## 2 92485E5FB5888ACD electric_bike 2023-05-06 18:54:08 2023-05-06 19:03:35
## 3 FB144B3FC8300187 electric_bike 2023-05-21 00:40:21 2023-05-21 00:44:36
## 4 DDEB93BC2CE9AA77 classic_bike 2023-05-10 16:47:01 2023-05-10 16:59:52
## 5 C07B70172FC92F59 classic_bike 2023-05-09 18:30:34 2023-05-09 18:39:28
## 6 2BA66385DF8F815A classic_bike 2023-05-30 15:01:21 2023-05-30 15:17:00
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
To get a better understanding of the variables in our dataset, we can
use the summary()
function which provides us with some
descriptive statistics.
summary(divvy)
## ride_id rideable_type started_at
## Length:5829030 Length:5829030 Min. :2022-06-01 00:00:04.00
## Class :character Class :character 1st Qu.:2022-07-27 14:54:51.00
## Mode :character Mode :character Median :2022-09-23 07:56:12.50
## Mean :2022-10-28 01:02:13.02
## 3rd Qu.:2023-02-05 15:35:06.75
## Max. :2023-05-31 23:59:58.00
##
## ended_at start_station_name start_station_id
## Min. :2022-06-01 00:02:38.00 Length:5829030 Length:5829030
## 1st Qu.:2022-07-27 15:16:15.75 Class :character Class :character
## Median :2022-09-23 08:14:32.50 Mode :character Mode :character
## Mean :2022-10-28 01:20:55.64
## 3rd Qu.:2023-02-05 15:50:20.25
## Max. :2023-06-07 23:04:26.00
##
## end_station_name end_station_id start_lat start_lng
## Length:5829030 Length:5829030 Min. :41.64 Min. :-87.87
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
##
## end_lat end_lng member_casual
## Min. : 0.00 Min. :-88.11 Length:5829030
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.37 Max. : 0.00
## NA's :5961 NA's :5961
Next, we take a look at the names of the variables in our dataset.
colnames(divvy)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
Understanding the types of data in our dataset is crucial as it dictates the kind of analyses we can perform on the variables.
str(divvy)
## spc_tbl_ [5,829,030 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5829030] "0D9FA920C3062031" "92485E5FB5888ACD" "FB144B3FC8300187" "DDEB93BC2CE9AA77" ...
## $ rideable_type : chr [1:5829030] "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5829030], format: "2023-05-07 19:53:48" "2023-05-06 18:54:08" ...
## $ ended_at : POSIXct[1:5829030], format: "2023-05-07 19:58:32" "2023-05-06 19:03:35" ...
## $ start_station_name: chr [1:5829030] "Southport Ave & Belmont Ave" "Southport Ave & Belmont Ave" "Halsted St & 21st St" "Carpenter St & Huron St" ...
## $ start_station_id : chr [1:5829030] "13229" "13229" "13162" "13196" ...
## $ end_station_name : chr [1:5829030] NA NA NA "Damen Ave & Cortland St" ...
## $ end_station_id : chr [1:5829030] NA NA NA "13133" ...
## $ start_lat : num [1:5829030] 41.9 41.9 41.9 41.9 42 ...
## $ start_lng : num [1:5829030] -87.7 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num [1:5829030] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:5829030] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:5829030] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
By inspecting the data, we can build a solid foundation for our subsequent analyses.
The first step of our data cleaning process involves handling missing values in the dataset.
To support time-based analysis and trip duration calculations, we
ensure the started_at
and ended_at
columns are
in datetime format.
Before the transformation, we check all entries to confirm they can be converted to the datetime format:
# Checking for potential parsing issues
problem_entries <- sum(is.na(lubridate::parse_date_time(divvy$started_at, orders = "ymd HMS")))
## Warning: 29 failed to parse.
## Warning: 17 failed to parse.
problem_entries
## [1] 29
This function returns the number of entries in
started_at
that will fail to parse into datetime format. We
have 29 entries in the started_at
column that cannot be
parsed, which we expect to be treated as NA
during
transformation.
# Transforming 'started_at' and 'ended_at' into datetime format
divvy <- divvy %>% mutate(
started_at = ymd_hms(started_at),
ended_at = ymd_hms(ended_at)
)
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `started_at = ymd_hms(started_at)`.
## Caused by warning:
## ! 29 failed to parse.
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
During the datetime conversion, we received a warning that 29 entries
in the started_at
column failed to parse. This is expected
due to known irregularities in the raw data. These entries were treated
as NA
and have been accounted for in our downstream
analyses.
Missing data can lead to biased or incorrect results. Therefore, it’s important to identify any missing values in our dataset.
# Total missing values
total_na <- sum(is.na(divvy))
paste("Total missing values in the dataset: ", total_na)
## [1] "Total missing values in the dataset: 3464851"
# Missing values per column
col_na <- colSums(is.na(divvy))
col_na_list <- paste(names(col_na), ":", col_na)
print(col_na_list)
## [1] "ride_id : 0" "rideable_type : 0"
## [3] "started_at : 29" "ended_at : 23"
## [5] "start_station_name : 834545" "start_station_id : 834677"
## [7] "end_station_name : 891757" "end_station_id : 891898"
## [9] "start_lat : 0" "start_lng : 0"
## [11] "end_lat : 5961" "end_lng : 5961"
## [13] "member_casual : 0"
# Calculate proportion of missing data
missing_prop <- total_na / prod(dim(divvy))
paste("Proportion of missing data in the dataset: ", round(missing_prop * 100, 2), "%")
## [1] "Proportion of missing data in the dataset: 4.57 %"
We detected a total of 3464851 missing data entries. In our subsequent step, we eliminate rows with these missing values.
# Summary statistics before dropping missing data
summary_before <- summary(divvy)
# Drop missing data
divvy <- divvy %>% drop_na()
# Summary statistics after dropping missing data
summary_after <- summary(divvy)
# Display summaries
list(before = summary_before, after = summary_after)
## $before
## ride_id rideable_type started_at
## Length:5829030 Length:5829030 Min. :2022-06-01 00:00:04.00
## Class :character Class :character 1st Qu.:2022-07-27 14:54:41.00
## Mode :character Mode :character Median :2022-09-23 07:56:13.00
## Mean :2022-10-28 01:02:14.65
## 3rd Qu.:2023-02-05 15:35:12.00
## Max. :2023-05-31 23:59:58.00
## NA's :29
## ended_at start_station_name start_station_id
## Min. :2022-06-01 00:02:38.00 Length:5829030 Length:5829030
## 1st Qu.:2022-07-27 15:16:22.00 Class :character Class :character
## Median :2022-09-23 08:14:46.00 Mode :character Mode :character
## Mean :2022-10-28 01:21:12.41
## 3rd Qu.:2023-02-05 15:50:41.50
## Max. :2023-06-07 23:04:26.00
## NA's :23
## end_station_name end_station_id start_lat start_lng
## Length:5829030 Length:5829030 Min. :41.64 Min. :-87.87
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
##
## end_lat end_lng member_casual
## Min. : 0.00 Min. :-88.11 Length:5829030
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.37 Max. : 0.00
## NA's :5961 NA's :5961
##
## $after
## ride_id rideable_type started_at
## Length:4494648 Length:4494648 Min. :2022-06-01 00:00:04.00
## Class :character Class :character 1st Qu.:2022-07-25 17:41:32.00
## Mode :character Mode :character Median :2022-09-21 14:54:26.50
## Mean :2022-10-26 20:48:30.31
## 3rd Qu.:2023-02-04 15:56:52.25
## Max. :2023-05-31 23:59:49.00
## ended_at start_station_name start_station_id
## Min. :2022-06-01 00:02:38.00 Length:4494648 Length:4494648
## 1st Qu.:2022-07-25 17:57:29.75 Class :character Class :character
## Median :2022-09-21 15:09:47.00 Mode :character Mode :character
## Mean :2022-10-26 21:04:47.50
## 3rd Qu.:2023-02-04 16:09:23.00
## Max. :2023-06-07 23:04:26.00
## end_station_name end_station_id start_lat start_lng
## Length:4494648 Length:4494648 Min. :41.65 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.06 Max. :-87.53
## end_lat end_lng member_casual
## Min. : 0.00 Min. :-87.84 Length:4494648
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.06 Max. : 0.00
After cleaning, we confirm that no missing values remain in our dataset.
Next, we generate new columns: trip_duration
(in
minutes), day_of_week
, and hour_of_day
to
support our forthcoming analysis.
# Creating 'trip_duration' column
divvy <- divvy %>% mutate(
trip_duration = as.numeric(difftime(ended_at, started_at, units="mins")) # nolint
)
# Creating 'day_of_week' and 'hour_of_day' columns
divvy <- divvy %>% mutate(
day_of_week = wday(started_at, label = TRUE),
hour_of_day = hour(started_at)
)
To analyze trends and patterns over time, we create additional time
variables: semi_annual
, quarterly
, and
monthly
. These variables allow us to observe trends within
different timeframes.
# Creating new time variables
divvy <- divvy %>%
mutate(semi_annual = case_when(
month(started_at) %in% 1:6 ~ paste(year(started_at), "H1"),
month(started_at) %in% 7:12 ~ paste(year(started_at), "H2")
),
quarterly = paste(year(started_at), "Q", quarter(started_at)),
monthly = paste(year(started_at), month(started_at, label = TRUE))
)
With these transformations, our dataset is now prepared for exploratory data analysis.
After preprocessing the data, we can now move on to the data analysis part. We are interested in comparing the behaviors of casual riders and annual members. We can analyze the frequency of rides, preferred bike type, average trip duration, and common routes for each user group.
First, we can start by comparing the frequency of rides between casual riders and annual members.
# Calculate ride count for each rider type
ride_counts <- divvy %>%
group_by(semi_annual, quarterly, monthly, member_casual) %>%
summarise(ride_count = n(), .groups = "drop")
# Preview the data frame
print(ride_counts, n = 50)
## # A tibble: 24 × 5
## semi_annual quarterly monthly member_casual ride_count
## <chr> <chr> <chr> <chr> <int>
## 1 2022 H1 2022 Q 2 2022 Jun casual 292064
## 2 2022 H1 2022 Q 2 2022 Jun member 328279
## 3 2022 H2 2022 Q 3 2022 Aug casual 270093
## 4 2022 H2 2022 Q 3 2022 Aug member 335228
## 5 2022 H2 2022 Q 3 2022 Jul casual 311675
## 6 2022 H2 2022 Q 3 2022 Jul member 330999
## 7 2022 H2 2022 Q 3 2022 Sep casual 220912
## 8 2022 H2 2022 Q 3 2022 Sep member 314228
## 9 2022 H2 2022 Q 4 2022 Dec casual 31505
## 10 2022 H2 2022 Q 4 2022 Dec member 103895
## 11 2022 H2 2022 Q 4 2022 Nov casual 73555
## 12 2022 H2 2022 Q 4 2022 Nov member 182237
## 13 2022 H2 2022 Q 4 2022 Oct casual 151324
## 14 2022 H2 2022 Q 4 2022 Oct member 262944
## 15 2023 H1 2023 Q 1 2023 Feb casual 32776
## 16 2023 H1 2023 Q 1 2023 Feb member 116783
## 17 2023 H1 2023 Q 1 2023 Jan casual 29621
## 18 2023 H1 2023 Q 1 2023 Jan member 118663
## 19 2023 H1 2023 Q 1 2023 Mar casual 46792
## 20 2023 H1 2023 Q 1 2023 Mar member 153655
## 21 2023 H1 2023 Q 2 2023 Apr casual 110537
## 22 2023 H1 2023 Q 2 2023 Apr member 213659
## 23 2023 H1 2023 Q 2 2023 May casual 177039
## 24 2023 H1 2023 Q 2 2023 May member 286185
# Visualize the result in a bar chart
ggplot(ride_counts, aes(x = member_casual, y = ride_count, fill = member_casual)) +
geom_bar(stat = "identity") +
labs(title = "Ride Frequency by User Type", x = "User Type", y = "Ride Count") +
theme_minimal()
# Visualize the result in a bar chart - Semi-annually
ggplot(ride_counts, aes(x = member_casual, y = ride_count, fill = member_casual)) +
geom_bar(stat = "identity") +
facet_wrap(~semi_annual) +
labs(title = "Semi-Annual Ride Frequency by User Type", x = "User Type", y = "Ride Count") +
theme_minimal()
# Visualize the result in a bar chart - Quarterly
ggplot(ride_counts, aes(x = member_casual, y = ride_count, fill = member_casual)) +
geom_bar(stat = "identity") +
facet_wrap(~quarterly) +
labs(title = "Quarterly Ride Frequency by User Type", x = "User Type", y = "Ride Count") +
theme_minimal()
# Visualize the result in a bar chart - Monthly
ggplot(ride_counts, aes(x = member_casual, y = ride_count, fill = member_casual)) +
geom_bar(stat = "identity") +
facet_wrap(~monthly) +
labs(title = "Monthly Ride Frequency by User Type", x = "User Type", y = "Ride Count") +
theme_minimal()
Next, let’s analyze which bike types are preferred by casual riders and annual members.
# Determine preferred bike type for each user type
bike_pref <- divvy %>%
group_by(semi_annual, quarterly, monthly, member_casual, rideable_type) %>%
summarise(bike_count = n(), .groups = "drop")
# Preview the data frame
print(bike_pref, n = 50)
## # A tibble: 60 × 6
## semi_annual quarterly monthly member_casual rideable_type bike_count
## <chr> <chr> <chr> <chr> <chr> <int>
## 1 2022 H1 2022 Q 2 2022 Jun casual classic_bike 169446
## 2 2022 H1 2022 Q 2 2022 Jun casual docked_bike 30210
## 3 2022 H1 2022 Q 2 2022 Jun casual electric_bike 92408
## 4 2022 H1 2022 Q 2 2022 Jun member classic_bike 236560
## 5 2022 H1 2022 Q 2 2022 Jun member electric_bike 91719
## 6 2022 H2 2022 Q 3 2022 Aug casual classic_bike 128307
## 7 2022 H2 2022 Q 3 2022 Aug casual docked_bike 25888
## 8 2022 H2 2022 Q 3 2022 Aug casual electric_bike 115898
## 9 2022 H2 2022 Q 3 2022 Aug member classic_bike 215326
## 10 2022 H2 2022 Q 3 2022 Aug member electric_bike 119902
## 11 2022 H2 2022 Q 3 2022 Jul casual classic_bike 155676
## 12 2022 H2 2022 Q 3 2022 Jul casual docked_bike 30599
## 13 2022 H2 2022 Q 3 2022 Jul casual electric_bike 125400
## 14 2022 H2 2022 Q 3 2022 Jul member classic_bike 216997
## 15 2022 H2 2022 Q 3 2022 Jul member electric_bike 114002
## 16 2022 H2 2022 Q 3 2022 Sep casual classic_bike 105078
## 17 2022 H2 2022 Q 3 2022 Sep casual docked_bike 19488
## 18 2022 H2 2022 Q 3 2022 Sep casual electric_bike 96346
## 19 2022 H2 2022 Q 3 2022 Sep member classic_bike 200668
## 20 2022 H2 2022 Q 3 2022 Sep member electric_bike 113560
## 21 2022 H2 2022 Q 4 2022 Dec casual classic_bike 12600
## 22 2022 H2 2022 Q 4 2022 Dec casual docked_bike 1872
## 23 2022 H2 2022 Q 4 2022 Dec casual electric_bike 17033
## 24 2022 H2 2022 Q 4 2022 Dec member classic_bike 60670
## 25 2022 H2 2022 Q 4 2022 Dec member electric_bike 43225
## 26 2022 H2 2022 Q 4 2022 Nov casual classic_bike 32953
## 27 2022 H2 2022 Q 4 2022 Nov casual docked_bike 5794
## 28 2022 H2 2022 Q 4 2022 Nov casual electric_bike 34808
## 29 2022 H2 2022 Q 4 2022 Nov member classic_bike 111503
## 30 2022 H2 2022 Q 4 2022 Nov member electric_bike 70734
## 31 2022 H2 2022 Q 4 2022 Oct casual classic_bike 61403
## 32 2022 H2 2022 Q 4 2022 Oct casual docked_bike 12396
## 33 2022 H2 2022 Q 4 2022 Oct casual electric_bike 77525
## 34 2022 H2 2022 Q 4 2022 Oct member classic_bike 151888
## 35 2022 H2 2022 Q 4 2022 Oct member electric_bike 111056
## 36 2023 H1 2023 Q 1 2023 Feb casual classic_bike 15446
## 37 2023 H1 2023 Q 1 2023 Feb casual docked_bike 2151
## 38 2023 H1 2023 Q 1 2023 Feb casual electric_bike 15179
## 39 2023 H1 2023 Q 1 2023 Feb member classic_bike 74223
## 40 2023 H1 2023 Q 1 2023 Feb member electric_bike 42560
## 41 2023 H1 2023 Q 1 2023 Jan casual classic_bike 13860
## 42 2023 H1 2023 Q 1 2023 Jan casual docked_bike 1682
## 43 2023 H1 2023 Q 1 2023 Jan casual electric_bike 14079
## 44 2023 H1 2023 Q 1 2023 Jan member classic_bike 76359
## 45 2023 H1 2023 Q 1 2023 Jan member electric_bike 42304
## 46 2023 H1 2023 Q 1 2023 Mar casual classic_bike 19380
## 47 2023 H1 2023 Q 1 2023 Mar casual docked_bike 2944
## 48 2023 H1 2023 Q 1 2023 Mar casual electric_bike 24468
## 49 2023 H1 2023 Q 1 2023 Mar member classic_bike 87596
## 50 2023 H1 2023 Q 1 2023 Mar member electric_bike 66059
## # ℹ 10 more rows
# Visualize the result in a bar chart
ggplot(bike_pref, aes(x = member_casual, y = bike_count, fill = rideable_type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Bike Preference by User Type", x = "User Type", y = "Bike Count") +
theme_minimal()
# Visualize the result in a bar chart - Semi-annually
bike_pref %>%
ggplot(aes(x = member_casual, y = bike_count, fill = rideable_type)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~semi_annual) +
labs(title = "Semi-annual Bike Preference by User Type", x = "User Type", y = "Bike Count") +
theme_minimal()
# Visualize the result in a bar chart - Quarterly
bike_pref %>%
ggplot(aes(x = member_casual, y = bike_count, fill = rideable_type)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~quarterly) +
labs(title = "Quarterly Bike Preference by User Type", x = "User Type", y = "Bike Count") +
theme_minimal()
# Visualize the result in a bar chart - Monthly
bike_pref %>%
ggplot(aes(x = member_casual, y = bike_count, fill = rideable_type)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~monthly) +
labs(title = "Monthly Bike Preference by User Type", x = "User Type", y = "Bike Count") +
theme_minimal()
Now, let’s find out if there is any difference in average trip duration between casual riders and annual members.
# Calculate average trip duration for each user type
avg_duration <- divvy %>%
group_by(semi_annual, quarterly, monthly, member_casual) %>%
summarise(avg_trip_duration = mean(trip_duration, na.rm = TRUE), .groups = "drop")
# Preview the data frame
print(avg_duration, n = 50)
## # A tibble: 24 × 5
## semi_annual quarterly monthly member_casual avg_trip_duration
## <chr> <chr> <chr> <chr> <dbl>
## 1 2022 H1 2022 Q 2 2022 Jun casual 25.0
## 2 2022 H1 2022 Q 2 2022 Jun member 13.7
## 3 2022 H2 2022 Q 3 2022 Aug casual 23.3
## 4 2022 H2 2022 Q 3 2022 Aug member 13.1
## 5 2022 H2 2022 Q 3 2022 Jul casual 25.1
## 6 2022 H2 2022 Q 3 2022 Jul member 13.5
## 7 2022 H2 2022 Q 3 2022 Sep casual 21.8
## 8 2022 H2 2022 Q 3 2022 Sep member 12.6
## 9 2022 H2 2022 Q 4 2022 Dec casual 14.8
## 10 2022 H2 2022 Q 4 2022 Dec member 10.2
## 11 2022 H2 2022 Q 4 2022 Nov casual 17.2
## 12 2022 H2 2022 Q 4 2022 Nov member 10.8
## 13 2022 H2 2022 Q 4 2022 Oct casual 20.5
## 14 2022 H2 2022 Q 4 2022 Oct member 11.7
## 15 2023 H1 2023 Q 1 2023 Feb casual 17.7
## 16 2023 H1 2023 Q 1 2023 Feb member 10.4
## 17 2023 H1 2023 Q 1 2023 Jan casual 14.9
## 18 2023 H1 2023 Q 1 2023 Jan member 10.0
## 19 2023 H1 2023 Q 1 2023 Mar casual 16.7
## 20 2023 H1 2023 Q 1 2023 Mar member 10.2
## 21 2023 H1 2023 Q 2 2023 Apr casual 22.6
## 22 2023 H1 2023 Q 2 2023 Apr member 11.6
## 23 2023 H1 2023 Q 2 2023 May casual 24.5
## 24 2023 H1 2023 Q 2 2023 May member 12.7
# Visualize the result in a bar chart
ggplot(avg_duration, aes(x = member_casual, y = avg_trip_duration, fill = member_casual)) +
geom_bar(stat = "identity") +
labs(title = "Average Trip Duration by User Type", x = "User Type", y = "Avg Trip Duration (mins)") +
theme_minimal()
# Visualize the result in a bar chart - Semi-annually
avg_duration %>%
ggplot(aes(x = member_casual, y = avg_trip_duration, fill = member_casual)) +
geom_bar(stat = "identity") +
facet_wrap(~semi_annual) +
labs(title = "Semi-annual Average Trip Duration by User Type", x = "User Type", y = "Avg Trip Duration (mins)") +
theme_minimal()
# Visualize the result in a bar chart - Quarterly
avg_duration %>%
ggplot(aes(x = member_casual, y = avg_trip_duration, fill = member_casual)) +
geom_bar(stat = "identity") +
facet_wrap(~quarterly) +
labs(title = "Quarterly Average Trip Duration by User Type", x = "User Type", y = "Avg Trip Duration (mins)") +
theme_minimal()
# Visualize the result in a bar chart - Monthly
avg_duration %>%
ggplot(aes(x = member_casual, y = avg_trip_duration, fill = member_casual)) +
geom_bar(stat = "identity") +
facet_wrap(~monthly) +
labs(title = "Monthly Average Trip Duration by User Type", x = "User Type", y = "Avg Trip Duration (mins)") +
theme_minimal()
Lastly, we can explore the common routes for each user group.
# Identify top 5 common routes for each user type
top_routes <- divvy %>%
mutate(route = paste(start_station_name, end_station_name, sep = " to ")) %>%
group_by(member_casual, route) %>%
summarise(route_count = n(), .groups = "drop") %>%
arrange(member_casual, desc(route_count)) %>%
group_by(member_casual) %>%
slice_head(n = 5)
# Preview the data frame
print(top_routes)
## # A tibble: 10 × 3
## # Groups: member_casual [2]
## member_casual route route_count
## <chr> <chr> <int>
## 1 casual Streeter Dr & Grand Ave to Streeter Dr & Grand Ave 10092
## 2 casual DuSable Lake Shore Dr & Monroe St to DuSable Lake … 6646
## 3 casual DuSable Lake Shore Dr & Monroe St to Streeter Dr &… 5101
## 4 casual Michigan Ave & Oak St to Michigan Ave & Oak St 4479
## 5 casual Millennium Park to Millennium Park 4028
## 6 member Ellis Ave & 60th St to University Ave & 57th St 6245
## 7 member University Ave & 57th St to Ellis Ave & 60th St 5894
## 8 member Ellis Ave & 60th St to Ellis Ave & 55th St 5719
## 9 member Ellis Ave & 55th St to Ellis Ave & 60th St 5394
## 10 member State St & 33rd St to Calumet Ave & 33rd St 4116
# Visualize the result in a bar chart
top_routes %>%
ggplot(aes(x = reorder(route, route_count), y = route_count, fill = member_casual)) +
geom_bar(stat = "identity") +
labs(title = "Top 5 Routes by User Type", x = "Route", y = "Route Count") +
coord_flip() +
theme_minimal()
These results will provide insights into the motivations and preferences of casual riders, and help design targeted marketing strategies to convert them into annual members.
Now, let’s compute some additional statistics that might be of interest.
# Calculate mean trip duration
mean_trip_duration <- mean(divvy$trip_duration, na.rm = TRUE)
# Identify max trip duration
max_trip_duration <- max(divvy$trip_duration, na.rm = TRUE)
# Find most frequent day of the week
day_of_week_mode <- divvy$day_of_week %>%
table() %>%
which.max() %>%
names()
# Visualize mean and maximum trip durations
cat("Mean trip duration: ", mean_trip_duration, " minutes\n")
## Mean trip duration: 16.28644 minutes
cat("Max trip duration: ", max_trip_duration, " minutes\n")
## Max trip duration: 32035.45 minutes
# Visualize most frequent day of the week
cat("Most frequent day of the week: ", day_of_week_mode, "\n")
## Most frequent day of the week: Sat
# Calculate average trip duration and number of rides for each user type by day_of_week
avg_trip_duration_day <- divvy %>%
group_by(member_casual, day_of_week) %>%
summarise(avg_trip_duration = mean(trip_duration, na.rm = TRUE), .groups = "drop")
print(avg_trip_duration_day, n = 50)
## # A tibble: 14 × 3
## member_casual day_of_week avg_trip_duration
## <chr> <ord> <dbl>
## 1 casual Sun 26.2
## 2 casual Mon 22.8
## 3 casual Tue 20.4
## 4 casual Wed 19.6
## 5 casual Thu 20.2
## 6 casual Fri 21.8
## 7 casual Sat 25.4
## 8 member Sun 13.6
## 9 member Mon 11.6
## 10 member Tue 11.7
## 11 member Wed 11.7
## 12 member Thu 11.8
## 13 member Fri 12.0
## 14 member Sat 13.7
rides_day <- divvy %>%
group_by(member_casual, day_of_week) %>%
summarise(ride_count = n(), .groups = "drop")
print(rides_day, n = 50)
## # A tibble: 14 × 3
## member_casual day_of_week ride_count
## <chr> <ord> <int>
## 1 casual Sun 289569
## 2 casual Mon 194972
## 3 casual Tue 202336
## 4 casual Wed 218110
## 5 casual Thu 233764
## 6 casual Fri 258174
## 7 casual Sat 350968
## 8 member Sun 303835
## 9 member Mon 374373
## 10 member Tue 438808
## 11 member Wed 456845
## 12 member Thu 442117
## 13 member Fri 387059
## 14 member Sat 343718
# Visualize average trip duration by day of the week for each rider type
ggplot(avg_trip_duration_day, aes(x = day_of_week, y = avg_trip_duration, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average Trip Duration by Day of the Week", x = "Day of the Week", y = "Average Trip Duration (minutes)") +
theme_minimal()
# Visualize ride counts by day of the week for each rider type
ggplot(rides_day, aes(x = day_of_week, y = ride_count, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Ride Counts by Day of the Week", x = "Day of the Week", y = "Ride Counts") +
theme_minimal()
# Calculate average trip duration and number of rides for each user type by hour_of_day
avg_trip_duration_hour <- divvy %>%
group_by(member_casual, hour_of_day) %>%
summarise(avg_trip_duration = mean(trip_duration, na.rm = TRUE), .groups = "drop")
print(avg_trip_duration_hour, n = 50)
## # A tibble: 48 × 3
## member_casual hour_of_day avg_trip_duration
## <chr> <int> <dbl>
## 1 casual 0 20.3
## 2 casual 1 21.7
## 3 casual 2 20.9
## 4 casual 3 19.8
## 5 casual 4 18.0
## 6 casual 5 15.4
## 7 casual 6 14.9
## 8 casual 7 14.6
## 9 casual 8 16.7
## 10 casual 9 22.5
## 11 casual 10 26.7
## 12 casual 11 27.3
## 13 casual 12 26.8
## 14 casual 13 26.7
## 15 casual 14 26.6
## 16 casual 15 24.9
## 17 casual 16 22.8
## 18 casual 17 21.3
## 19 casual 18 21.5
## 20 casual 19 21.1
## 21 casual 20 21.0
## 22 casual 21 19.7
## 23 casual 22 19.3
## 24 casual 23 19.9
## 25 member 0 11.8
## 26 member 1 11.9
## 27 member 2 12.2
## 28 member 3 12.1
## 29 member 4 12.4
## 30 member 5 10.2
## 31 member 6 10.5
## 32 member 7 11.1
## 33 member 8 11.0
## 34 member 9 11.2
## 35 member 10 12.0
## 36 member 11 12.2
## 37 member 12 12.1
## 38 member 13 12.2
## 39 member 14 12.6
## 40 member 15 12.5
## 41 member 16 12.7
## 42 member 17 13.0
## 43 member 18 12.9
## 44 member 19 12.7
## 45 member 20 12.5
## 46 member 21 12.1
## 47 member 22 12.2
## 48 member 23 12.2
rides_hour <- divvy %>%
group_by(member_casual, hour_of_day) %>%
summarise(ride_count = n(), .groups = "drop")
print(rides_hour, n = 50)
## # A tibble: 48 × 3
## member_casual hour_of_day ride_count
## <chr> <int> <int>
## 1 casual 0 32048
## 2 casual 1 20833
## 3 casual 2 12246
## 4 casual 3 6763
## 5 casual 4 4515
## 6 casual 5 8707
## 7 casual 6 23392
## 8 casual 7 40351
## 9 casual 8 55069
## 10 casual 9 56194
## 11 casual 10 72237
## 12 casual 11 93447
## 13 casual 12 109910
## 14 casual 13 114537
## 15 casual 14 121201
## 16 casual 15 134849
## 17 casual 16 153625
## 18 casual 17 170546
## 19 casual 18 148467
## 20 casual 19 111926
## 21 casual 20 81398
## 22 casual 21 69352
## 23 casual 22 61477
## 24 casual 23 44803
## 25 member 0 25576
## 26 member 1 15656
## 27 member 2 8737
## 28 member 3 5268
## 29 member 4 6097
## 30 member 5 26050
## 31 member 6 81663
## 32 member 7 151674
## 33 member 8 180596
## 34 member 9 120043
## 35 member 10 109540
## 36 member 11 129722
## 37 member 12 149413
## 38 member 13 147817
## 39 member 14 148811
## 40 member 15 183336
## 41 member 16 249057
## 42 member 17 297750
## 43 member 18 233324
## 44 member 19 165950
## 45 member 20 115173
## 46 member 21 88671
## 47 member 22 65509
## 48 member 23 41322
# Visualize average trip duration by hour of the day for each rider type
ggplot(avg_trip_duration_hour, aes(x = hour_of_day, y = avg_trip_duration, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average Trip Duration by Hour of the Day", x = "Hour of the Day", y = "Average Trip Duration (minutes)") +
theme_minimal()
# Visualize ride counts by hour of the day for each rider type
ggplot(rides_hour, aes(x = hour_of_day, y = ride_count, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Ride Counts by Hour of the Day", x = "Hour of the Day", y = "Ride Counts") +
theme_minimal()
Based on the data analysis and the ultimate goal of converting casual riders into annual members, we conclude that there are different usage patterns between the two user types. Casual riders tend to ride longer, prefer classic and electric bikes, and use the service more heavily on weekends, often on specific routes that might be associated with recreational activities. Members, on the other hand, use the service more frequently, have shorter ride durations, use classic bikes more often, and use the service extensively throughout the week, indicating daily commuting patterns.
These differences provide an opportunity to tailor marketing strategies to convert casual riders into annual members.
Introduce Cost-Effective Membership Plans: Casual riders tend to have longer ride durations. They may benefit from membership plans that offer better cost-effectiveness for longer rides. Introduce membership plans highlighting the financial benefits for longer trips to attract this group. This could be bundled with a trial period to reduce the risk for casual riders.
Launch Weekend Promotions for Members: Members use the service more consistently throughout the week, but both casual riders and members ride longer on weekends. This pattern suggests a potential interest in leisurely or recreational rides among members. Offering weekend promotions or organizing member-exclusive events during weekends could incentivize casual riders to convert into members to avail of these benefits.
Enhance Bike Availability: Classic bikes are preferred by both user groups. Ensuring the availability of classic bikes could improve the user experience. This could be promoted among casual users to showcase the reliability of the service.
Implementing these recommendations would require a balanced approach, ensuring existing services for members are not disrupted. Monitoring the impact of these changes and taking user feedback into consideration will be crucial to fine-tune these strategies moving forward.