tidier
package provides ‘Apache Spark’ style window aggregation for R dataframes and remote dbplyr
tbls via ‘mutate’ in ‘dplyr’ flavour.
Example
Create a new column with average temp over last seven days in the same month.
set.seed(101)
airquality |>
# create date column
dplyr::mutate(date_col = lubridate::make_date(1973, Month, Day)) |>
# create gaps by removing some days
dplyr::slice_sample(prop = 0.8) |>
# compute mean temperature over last seven days in the same month
tidier::mutate(avg_temp_over_last_week = mean(Temp, na.rm = TRUE),
.order_by = Day,
.by = Month,
.frame = c(lubridate::days(7), # 7 days before current row
lubridate::days(-1) # do not include current row
),
.index = date_col
)
#> # A tibble: 122 × 8
#> Month Ozone Solar.R Wind Temp Day date_col avg_temp_over_last_week
#> <int> <int> <int> <dbl> <int> <int> <date> <dbl>
#> 1 6 NA 286 8.6 78 1 1973-06-01 NaN
#> 2 6 NA 242 16.1 67 3 1973-06-03 78
#> 3 6 NA 186 9.2 84 4 1973-06-04 72.5
#> 4 6 NA 264 14.3 79 6 1973-06-06 76.3
#> 5 6 29 127 9.7 82 7 1973-06-07 77
#> 6 6 NA 273 6.9 87 8 1973-06-08 78
#> 7 6 NA 259 10.9 93 11 1973-06-11 83
#> 8 6 NA 250 9.2 92 12 1973-06-12 85.2
#> 9 6 23 148 8 82 13 1973-06-13 86.6
#> 10 6 NA 332 13.8 80 14 1973-06-14 87.2
#> # ℹ 112 more rows
Features
-
mutate
supports-
.by
(group by), -
.order_by
(order by), -
.frame
(endpoints of window frame), -
.index
(identify index column like date column, in df version only), -
.complete
(whether to compute over incomplete window, in df version only).
-
-
mutate
automatically uses a future backend (viafurrr
, in df version only).
Motivation
This implementation is inspired by Apache Spark’s windowSpec
class with rangeBetween
and rowsBetween
.
Ecosystem
dbplyr
implements this viadbplyr::win_over
enablingsparklyr
users to write window computations. Also see,dbplyr::window_order
/dbplyr::window_frame
.tidier
’smutate
wraps this functionality via uniform syntax across dataframes and remote tbls.tidypyspark
python package implementsmutate
style window computation API for pyspark.
Acknowledgements
tidier
package is deeply indebted to three amazing packages and people behind it.
D (2023). _dplyr: A
Wickham H, François R, Henry L, Müller K, Vaughan 1.1.0,
Grammar of Data Manipulation_. R package version <https://CRAN.R-project.org/package=dplyr>.
D (2021). _slider: Sliding Window Functions_. R package
Vaughan 0.2.2, <https://CRAN.R-project.org/package=slider>. version
E (2023). _dbplyr: A 'dplyr' Back End
Wickham H, Girlich M, Ruiz for Databases_. R package version 2.3.2,
<https://CRAN.R-project.org/package=dbplyr>.