tidier package provides ‘Apache Spark’ style window aggregation for R dataframes and remote dbplyr tbls via ‘mutate’ in ‘dplyr’ flavour.
Example
Create a new column with average temp over last seven days in the same month.
set.seed(101)
airquality |>
# create date column
dplyr::mutate(date_col = lubridate::make_date(1973, Month, Day)) |>
# create gaps by removing some days
dplyr::slice_sample(prop = 0.8) |>
# compute mean temperature over last seven days in the same month
tidier::mutate(avg_temp_over_last_week = mean(Temp, na.rm = TRUE),
.order_by = Day,
.by = Month,
.frame = c(lubridate::days(7), # 7 days before current row
lubridate::days(-1) # do not include current row
),
.index = date_col
)
#> # A tibble: 122 × 8
#> Month Ozone Solar.R Wind Temp Day date_col avg_temp_over_last_week
#> <int> <int> <int> <dbl> <int> <int> <date> <dbl>
#> 1 6 NA 286 8.6 78 1 1973-06-01 NaN
#> 2 6 NA 242 16.1 67 3 1973-06-03 78
#> 3 6 NA 186 9.2 84 4 1973-06-04 72.5
#> 4 6 NA 264 14.3 79 6 1973-06-06 76.3
#> 5 6 29 127 9.7 82 7 1973-06-07 77
#> 6 6 NA 273 6.9 87 8 1973-06-08 78
#> 7 6 NA 259 10.9 93 11 1973-06-11 83
#> 8 6 NA 250 9.2 92 12 1973-06-12 85.2
#> 9 6 23 148 8 82 13 1973-06-13 86.6
#> 10 6 NA 332 13.8 80 14 1973-06-14 87.2
#> # ℹ 112 more rowsFeatures
-
mutatesupports-
.by(group by), -
.order_by(order by), -
.frame(endpoints of window frame), -
.index(identify index column like date column, in df version only), -
.complete(whether to compute over incomplete window, in df version only).
-
-
mutateautomatically uses a future backend (viafurrr, in df version only).
Motivation
This implementation is inspired by Apache Spark’s windowSpec class with rangeBetween and rowsBetween.
Ecosystem
dbplyrimplements this viadbplyr::win_overenablingsparklyrusers to write window computations. Also see,dbplyr::window_order/dbplyr::window_frame.tidier’smutatewraps this functionality via uniform syntax across dataframes and remote tbls.tidypysparkpython package implementsmutatestyle window computation API for pyspark.
Acknowledgements
tidier package is deeply indebted to three amazing packages and people behind it.
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
Grammar of Data Manipulation_. R package version 1.1.0,
<https://CRAN.R-project.org/package=dplyr>.Vaughan D (2021). _slider: Sliding Window Functions_. R package
version 0.2.2, <https://CRAN.R-project.org/package=slider>.Wickham H, Girlich M, Ruiz E (2023). _dbplyr: A 'dplyr' Back End
for Databases_. R package version 2.3.2,
<https://CRAN.R-project.org/package=dbplyr>.