Skip to contents

Provides supercharged version of mutate with group_by, order_by and aggregation over arbitrary window frame around a row for dataframes and lazy (remote) tbls of class tbl_lazy.

Usage

mutate_(
  x,
  ...,
  .by,
  .order_by,
  .frame,
  .index,
  .desc = FALSE,
  .complete = FALSE
)

Arguments

x

(data.frame or tbl_lazy)

...

expressions to be passed to mutate

.by

(character vector, optional: Yes) Columns to group by

.order_by

(string, optional: Yes) Columns to order by

.frame

(vector, optional: Yes) Vector of length 2 indicating the number of rows to consider before and after the current row. When argument .index is provided (typically a column of type date or datetime), before and after can be interval objects. See examples. When input is tbl_lazy, only number of rows as vector of length 2 is supported.

.index

(string, optional: Yes, default: NULL) index column. This is supported when input is a dataframe only.

.desc

(flag, default: FALSE) Whether to order in descending order

.complete

(flag, default: FALSE) This will be passed to slider::slide / slider::slide_vec. Should the function be evaluated on complete windows only? If FALSE or NULL, the default, then partial computations will be allowed. This is supported when input is a dataframe only.

Value

data.frame or tbl_lazy

Details

A window function returns a value for every input row of a dataframe or lazy_tbl based on a group of rows (frame) in the neighborhood of the input row. This function implements computation over groups (partition_by in SQL) in a predefined order (order_by in SQL) across a neighborhood of rows (frame) defined by a (up, down) where

  • up/down are number of rows before and after the corresponding row

  • up/down are interval objects (ex: c(days(2), days(1))). Interval objects are currently supported for dataframe only. (not tbl_lazy)

This implementation is inspired by spark's window API.

Implementation Details:

For dataframe input:

  • Iteration per row over the window is implemented using the versatile slider.

  • Application of a window aggregation can be optionally run in parallel over multiple groups (see argument .by) by setting a future parallel backend. This is implemented using furrr package.

  • function subsumes regular usecases of mutate

For tbl_lazy input:

See also

mutate

Examples

library("magrittr")
# example 1 (simple case with dataframe)
# Using iris dataset,
# compute cumulative mean of column `Sepal.Length`
# ordered by `Petal.Width` and `Sepal.Width` columns
# grouped by `Petal.Length` column

iris %>%
  tidier::mutate_(sl_mean = mean(Sepal.Length),
                  .order_by = c("Petal.Width", "Sepal.Width"),
                  .by = "Petal.Length",
                  .frame = c(Inf, 0),
                  ) %>%
  dplyr::slice_min(n = 3, Petal.Width, by = Species)
#> # A tibble: 15 × 6
#>    Petal.Length Sepal.Length Sepal.Width Petal.Width Species    sl_mean
#>           <dbl>        <dbl>       <dbl>       <dbl> <fct>        <dbl>
#>  1          1.4          4.8         3           0.1 setosa        4.8 
#>  2          1.4          4.9         3.6         0.1 setosa        4.85
#>  3          1.1          4.3         3           0.1 setosa        4.3 
#>  4          1.5          4.9         3.1         0.1 setosa        4.9 
#>  5          1.5          5.2         4.1         0.1 setosa        5.05
#>  6          3.5          5           2           1   versicolor    5   
#>  7          3.5          5.7         2.6         1   versicolor    5.35
#>  8          3.3          4.9         2.4         1   versicolor    4.95
#>  9          3.3          5           2.3         1   versicolor    5   
#> 10          4.1          5.8         2.7         1   versicolor    5.8 
#> 11          4            6           2.2         1   versicolor    6   
#> 12          3.7          5.5         2.4         1   versicolor    5.5 
#> 13          5.6          6.1         2.6         1.4 virginica     6.1 
#> 14          5.1          6.3         2.8         1.5 virginica     6.3 
#> 15          5            6           2.2         1.5 virginica     6   

# example 2 (detailed case with dataframe)
# Using a sample airquality dataset,
# compute mean temp over last seven days in the same month for every row

set.seed(101)
airquality %>%
  # create date column
  dplyr::mutate(date_col = lubridate::make_date(1973, Month, Day)) %>%
  # create gaps by removing some days
  dplyr::slice_sample(prop = 0.8) %>%
  dplyr::arrange(date_col) %>%
  # compute mean temperature over last seven days in the same month
  tidier::mutate_(avg_temp_over_last_week = mean(Temp, na.rm = TRUE),
                  .order_by = "Day",
                  .by = "Month",
                  .frame = c(lubridate::days(7), # 7 days before current row
                            lubridate::days(-1) # do not include current row
                            ),
                  .index = "date_col"
                  )
#> # A tibble: 122 × 8
#>    Month Ozone Solar.R  Wind  Temp   Day date_col   avg_temp_over_last_week
#>    <int> <int>   <int> <dbl> <int> <int> <date>                       <dbl>
#>  1     6    NA     264  14.3    79     6 1973-06-06                    76.3
#>  2     6    NA     250   9.2    92    12 1973-06-12                    85.2
#>  3     6    37     284  20.7    72    17 1973-06-17                    85.2
#>  4     6    NA     135   8      75    25 1973-06-25                    75.5
#>  5     6    NA      31  14.9    77    29 1973-06-29                    76.4
#>  6     7    49     248   9.2    85     2 1973-07-02                    84  
#>  7     7    77     276   5.1    88     7 1973-07-07                    83.4
#>  8     7    27     175  14.9    81    13 1973-07-13                    83  
#>  9     7    48     260   6.9    81    16 1973-07-16                    82.7
#> 10     7    52      82  12      86    27 1973-07-27                    82.4
#> # ℹ 112 more rows
# example 3
airquality %>%
   # create date column as character
   dplyr::mutate(date_col =
                   as.character(lubridate::make_date(1973, Month, Day))
                 ) %>%
   tibble::as_tibble() %>%
   # as `tbl_lazy`
   dbplyr::memdb_frame() %>%
   mutate_(avg_temp = mean(Temp),
           .by = "Month",
           .order_by = "date_col",
           .frame = c(3, 3)
           ) %>%
   dplyr::collect() %>%
   dplyr::select(Ozone, Solar.R, Wind, Temp, Month, Day, date_col, avg_temp)
#> # A tibble: 153 × 8
#>    Ozone Solar.R  Wind  Temp Month   Day date_col   avg_temp
#>    <int>   <int> <dbl> <int> <int> <int> <chr>         <dbl>
#>  1    41     190   7.4    67     5     1 1973-05-01     68.8
#>  2    36     118   8      72     5     2 1973-05-02     66.2
#>  3    12     149  12.6    74     5     3 1973-05-03     66.2
#>  4    18     313  11.5    62     5     4 1973-05-04     66  
#>  5    NA      NA  14.3    56     5     5 1973-05-05     64.9
#>  6    28      NA  14.9    66     5     6 1973-05-06     63.3
#>  7    23     299   8.6    65     5     7 1973-05-07     62.6
#>  8    19      99  13.8    59     5     8 1973-05-08     64.3
#>  9     8      19  20.1    61     5     9 1973-05-09     66.1
#> 10    NA     194   8.6    69     5    10 1973-05-10     66.1
#> # ℹ 143 more rows