Skip to contents

Provides supercharged version of mutate with group_by, order_by and aggregation over arbitrary window frame around a row for dataframes and lazy (remote) tbls of class tbl_lazy.

Usage

mutate(x, ..., .by, .order_by, .frame, .index, .complete = FALSE)

Arguments

x

(data.frame or tbl_lazy)

...

expressions to be passed to mutate

.by

(expression, optional: Yes) Columns to group by

.order_by

(expression, optional: Yes) Columns to order by

.frame

(vector, optional: Yes) Vector of length 2 indicating the number of rows to consider before and after the current row. When argument .index is provided (typically a column of type date or datetime), before and after can be interval objects. See examples. When input is tbl_lazy, only number of rows as vector of length 2 is supported.

.index

(expression, optional: Yes, default: NULL) index column. This is supported when input is a dataframe only.

.complete

(flag, default: FALSE) This will be passed to slider::slide / slider::slide_vec. Should the function be evaluated on complete windows only? If FALSE or NULL, the default, then partial computations will be allowed. This is supported when input is a dataframe only.

Value

data.frame or tbl_lazy

Details

A window function returns a value for every input row of a dataframe or lazy_tbl based on a group of rows (frame) in the neighborhood of the input row. This function implements computation over groups (partition_by in SQL) in a predefined order (order_by in SQL) across a neighborhood of rows (frame) defined by a (up, down) where

  • up/down are number of rows before and after the corresponding row

  • up/down are interval objects (ex: c(days(2), days(1))). Interval objects are currently supported for dataframe only. (not tbl_lazy)

This implementation is inspired by spark's window API.

Implementation Details:

For dataframe input:

  • Iteration per row over the window is implemented using the versatile slider.

  • Application of a window aggregation can be optionally run in parallel over multiple groups (see argument .by) by setting a future parallel backend. This is implemented using furrr package.

  • function subsumes regular usecases of mutate

For tbl_lazy input:

See also

mutate_

Examples

library("magrittr")
# example 1 (simple case with dataframe)
# Using iris dataset,
# compute cumulative mean of column `Sepal.Length`
# ordered by `Petal.Width` and `Sepal.Width` columns
# grouped by `Petal.Length` column

iris %>%
  mutate(sl_mean = mean(Sepal.Length),
         .order_by = c(Petal.Width, Sepal.Width),
         .by = Petal.Length,
         .frame = c(Inf, 0),
         ) %>%
  dplyr::slice_min(n = 3, Petal.Width, by = Species)
#> 
#> Attaching package: ‘purrr’
#> The following object is masked from ‘package:magrittr’:
#> 
#>     set_names
#> # A tibble: 15 × 6
#>    Petal.Length Sepal.Length Sepal.Width Petal.Width Species    sl_mean
#>           <dbl>        <dbl>       <dbl>       <dbl> <fct>        <dbl>
#>  1          1.4          4.8         3           0.1 setosa        4.8 
#>  2          1.4          4.9         3.6         0.1 setosa        4.85
#>  3          1.1          4.3         3           0.1 setosa        4.3 
#>  4          1.5          4.9         3.1         0.1 setosa        4.9 
#>  5          1.5          5.2         4.1         0.1 setosa        5.05
#>  6          3.5          5           2           1   versicolor    5   
#>  7          3.5          5.7         2.6         1   versicolor    5.35
#>  8          4            6           2.2         1   versicolor    6   
#>  9          3.3          5           2.3         1   versicolor    5   
#> 10          3.3          4.9         2.4         1   versicolor    4.95
#> 11          3.7          5.5         2.4         1   versicolor    5.5 
#> 12          4.1          5.8         2.7         1   versicolor    5.8 
#> 13          5.6          6.1         2.6         1.4 virginica     6.1 
#> 14          5            6           2.2         1.5 virginica     6   
#> 15          5.1          6.3         2.8         1.5 virginica     6.3 

# example 2 (detailed case with dataframe)
# Using a sample airquality dataset,
# compute mean temp over last seven days in the same month for every row

set.seed(101)
airquality %>%
  # create date column
  dplyr::mutate(date_col = lubridate::make_date(1973, Month, Day)) %>%
  # create gaps by removing some days
  dplyr::slice_sample(prop = 0.8) %>%
  dplyr::arrange(date_col) %>%
  # compute mean temperature over last seven days in the same month
  tidier::mutate(avg_temp_over_last_week = mean(Temp, na.rm = TRUE),
                 .order_by = Day,
                 .by = Month,
                 .frame = c(lubridate::days(7), # 7 days before current row
                            lubridate::days(-1) # do not include current row
                            ),
                 .index = date_col
                 )
#> # A tibble: 122 × 8
#>    Month Ozone Solar.R  Wind  Temp   Day date_col   avg_temp_over_last_week
#>    <int> <int>   <int> <dbl> <int> <int> <date>                       <dbl>
#>  1     6    NA     286   8.6    78     1 1973-06-01                   NaN  
#>  2     6    NA     242  16.1    67     3 1973-06-03                    78  
#>  3     6    NA     186   9.2    84     4 1973-06-04                    72.5
#>  4     6    NA     264  14.3    79     6 1973-06-06                    76.3
#>  5     6    29     127   9.7    82     7 1973-06-07                    77  
#>  6     6    NA     273   6.9    87     8 1973-06-08                    78  
#>  7     6    NA     259  10.9    93    11 1973-06-11                    83  
#>  8     6    NA     250   9.2    92    12 1973-06-12                    85.2
#>  9     6    23     148   8      82    13 1973-06-13                    86.6
#> 10     6    NA     332  13.8    80    14 1973-06-14                    87.2
#> # ℹ 112 more rows
# example 3
airquality %>%
   # create date column as character
   dplyr::mutate(date_col =
                   as.character(lubridate::make_date(1973, Month, Day))
                 ) %>%
   tibble::as_tibble() %>%
   # as `tbl_lazy`
   dbplyr::memdb_frame() %>%
   mutate(avg_temp = mean(Temp),
          .by = Month,
          .order_by = date_col,
          .frame = c(3, 3)
          ) %>%
   dplyr::collect() %>%
   dplyr::select(Ozone, Solar.R, Wind, Temp, Month, Day, date_col, avg_temp)
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # A tibble: 153 × 8
#>    Ozone Solar.R  Wind  Temp Month   Day date_col   avg_temp
#>    <int>   <int> <dbl> <int> <int> <int> <chr>         <dbl>
#>  1    41     190   7.4    67     5     1 1973-05-01     68.8
#>  2    36     118   8      72     5     2 1973-05-02     66.2
#>  3    12     149  12.6    74     5     3 1973-05-03     66.2
#>  4    18     313  11.5    62     5     4 1973-05-04     66  
#>  5    NA      NA  14.3    56     5     5 1973-05-05     64.9
#>  6    28      NA  14.9    66     5     6 1973-05-06     63.3
#>  7    23     299   8.6    65     5     7 1973-05-07     62.6
#>  8    19      99  13.8    59     5     8 1973-05-08     64.3
#>  9     8      19  20.1    61     5     9 1973-05-09     66.1
#> 10    NA     194   8.6    69     5    10 1973-05-10     66.1
#> # ℹ 143 more rows