CAS : cache and invalidation with `tar_repository_cas_local()` [help] #1487

katossky · 2025-05-13T17:24:39Z

katossky
May 13, 2025

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

I find myself in the case described in #1232 of a versionned cache shared between mutliple users. In our case, the pipeline (simulating the French retirement system) has variants (economic, demographic and ligislative hypotheses as well as modelling choices, combining to thousands of theoretically possible pipelines) and we frequently switch between a small subset of them as we work on a given project. Some early steps of the pipeline do not change much yet require hours of computation (simulating millions of careers under said hypotheses). Typically, a late question from client would require me to add a new variant to one of my colleagues work, variant that shares 90% of the computation with existing ones. I don't think that branching over variants works well in our case as it would result in an absurd number of branches, of which most would never be evaluated. Using custom repository in tar_target seemed like the perfect caching solution, but my experiments have not worked as expected — prompting me to wonder whether I have misunderstood the intended behavior of a CAS system.

Below is a toy example of a parametrizable pipeline (where parameter is n) and where storage happens in the _ cache repository. Only the diamonds target ever get skipped even though by run 3 all targets already exist in _cache. I would have expected the other targets to be skipped as well.

Given the high quality of targets, I am certain this behavior is intentional. That said, I am not sure I understand the benefit of adressable storage without caching. To me, working in a team, the idea would indeed be that anytime 2 people run 2 pipelines with some shared steps, they benefit from each-other's already-computed targets. (I for instance expected that the hash used for adressable storage was the invalidation hash, computed from the upstream targets, because for a proper hash we would need to actuall compute the target!) The ambiguity also seems to stem from the very terminology used in issues and dicussions, especially around the often-used word “cache,” which to me typically suggests that unnecessary computation are skipped. Not sure that the reader of help page for tar_repository_cas_local would understand that CAS is not cache.

I understand that one can version the metadata file to obtain proper invalidation, but in our case that feels heavy: we would need multiple git branches for each variant and merge updates from master frequently as the rest of the codebase evolves. Moreover, our use case doesn't fundamentally require Git — users could in principle simply install our package and explore different variants without caring about version control, while still benefiting greatly from effective caching.

Could someone clarify whether I am missing a key feature (like a clever way of branching, or an other type of CAS), whether the kind of caching I have in mind is within the scope of targets (in case it does not already exist), and/or whether there are workarounds? If a viable approach exists, my team is seriously considering moving to targets, and I would then be happy to invest time into contributing.

 library(targets)

# ---- toy example of a parametrizable pipeline -------

# please pardon the ugliness of the eval subtitute thing
makeScript <- function(times = 100, repo = "local"){
  eval(substitute(
    tar_script(
      code = {
        list(
          tar_target(
            n, times,
            repository = repo
          ),
          tar_target(
            diamonds, ggplot2::diamonds,
            repository = repo
          ),
          tar_target(
            bigDiamonds, {
              Sys.sleep(5)
              ggplot2::diamonds |>
                replicate(n = n, simplify = FALSE) |>
                dplyr::bind_rows()
            },
            repository = repo
          ),
          tar_target(
            bigDiamondMean, {
              Sys.sleep(5)
              bigDiamonds |>
                dplyr::summarise(price = mean(price))
            },
            repository = repo
          )
        )
      },
      ask = FALSE
    ),
    list(times = times, repo = repo)
  ))
}

# make sure nothing is there
unlink("_cache", recursive = TRUE)
unlink("_targets", recursive = TRUE)

# ----- first variant ----- 
makeScript(times = 100, repo = tar_repository_cas_local(path = "_cache"))
tar_make()
# ▶ dispatched target n
# ● completed target n [0 seconds, 51 bytes]
# ▶ dispatched target diamonds
# ● completed target diamonds [0.218 seconds, 513.653 kilobytes]
# ▶ dispatched target bigDiamonds
# ● completed target bigDiamonds [5.594 seconds, 50.413 megabytes]
# ▶ dispatched target bigDiamondMean
# ● completed target bigDiamondMean [5.125 seconds, 143 bytes]
# ▶ ended pipeline [23.156 seconds]

# ----- second variant ----- 
makeScript(times = 200, repo = tar_repository_cas_local(path = "_cache"))
tar_make()
# ▶ dispatched target n
# ● completed target n [0 seconds, 51 bytes]
# ✔ skipping targets (1 so far)...
# ▶ dispatched target bigDiamonds
# ● completed target bigDiamonds [6.079 seconds, 100.809 megabytes]
# ▶ dispatched target bigDiamondMean
# ● completed target bigDiamondMean [5.062 seconds, 143 bytes]
# ▶ ended pipeline [35.094 seconds]

# ----- back to first variant -----
makeScript(times = 100, repo = tar_repository_cas_local(path = "_cache"))
tar_make()
# ▶ dispatched target n
# ● completed target n [0 seconds, 51 bytes]
# ✔ skipping targets (1 so far)...
# ▶ dispatched target bigDiamonds
# ● completed target bigDiamonds [5.828 seconds, 50.413 megabytes]
# ▶ dispatched target bigDiamondMean
# ● completed target bigDiamondMean [5.125 seconds, 143 bytes]]
# ▶ ended pipeline [22.765 seconds]

wlandau · 2025-05-14T15:33:28Z

wlandau
May 14, 2025
Maintainer

The CAS system in ``targets` has 2 goals:

Avoid recomputation when it is feasible to revert the metadata to a previous state (the use case that Noam described in [general] New cloud hashing approach and collaborative workflows #1232 (comment)).
Allow users to insert custom upload/download methods to store data, as an alternative to the native AWS/GCP integration (e.g. for Azure: Azure storage #1323 (comment)).

Unfortunately, (1) requires a way to share metadata across collaborators. You're right to point out that this is a limitation and a source of friction for collaborative workflows. Although later/advanced features of targets do support collaboration, collaborative workflows have never been the primary use case of the package.

Git does not necessarily have to be the mechanism for sharing metadata, it just happens to be a rigorous one. On the other hand, it can be hard to think of an alternative. targets does have a way to switch among different projects, but in your case, you might risk overwriting someone else's project.

Below is a toy example of a parametrizable pipeline (where parameter is n) and where storage happens in the _ cache repository. Only the diamonds target ever get skipped even though by run 3 all targets already exist in _cache. I would have expected the other targets to be skipped as well.

This is actually expected behavior. If you run:

makeScript(times = 100, repo = tar_repository_cas_local(path = "_cache")) # first version
tar_make()
makeScript(times = 200, repo = tar_repository_cas_local(path = "_cache")) # second version
tar_make()
makeScript(times = 100, repo = tar_repository_cas_local(path = "_cache")) # switch back to first version

tar_sitrep() # Show why targets are outdated / up to date

you will see:

# A tibble: 4 × 11                           
  name           meta  always never command depend format repository iteration file  seed 
  <chr>          <lgl> <lgl>  <lgl> <lgl>   <lgl>  <lgl>  <lgl>      <lgl>     <lgl> <lgl>
1 bigDiamonds    FALSE FALSE  FALSE FALSE   FALSE  FALSE  FALSE      FALSE     FALSE FALSE
2 n              FALSE FALSE  FALSE TRUE    FALSE  FALSE  FALSE      FALSE     FALSE FALSE
3 diamonds       FALSE FALSE  FALSE FALSE   FALSE  FALSE  FALSE      FALSE     FALSE FALSE
4 bigDiamondMean FALSE FALSE  FALSE FALSE   FALSE  FALSE  FALSE      FALSE     FALSE FALSE

In the output of tar_sitrep(), the command cue of target n is TRUE, meaning the metadata thinks it detected a changed command. The next tar_make() will rerun target n, then detect a change in the hash of the output of n, then run bigDiamonds because the metadata is out of sync with the n = 100 data hash, and so on. Reverting to the old metadata would have caused n to appear up to date.

2 replies

katossky May 14, 2025
Author

Thank you very much for your time and for the explanation. I do get that my use case is that at the limit of what targets was originally intended for. I also do understand the invalidation mechanism but tar_sitrep() does make it clearer. Thanks.

I have to think a little more about this metadata syncing thing. After what you say, I am not even sure whether syncing metadata covers our use case as I frame it. Do I understand correctly that there is no way, in the current setup, of having the following code (or its static equivalent with tar_map) to ever skip already-computed targets ? (both should exist in the CAS but as they were computed separately, there is no metadata object that list them both and thus they can't be both "valid" at the same time)

list(
  tar_target(
    n, c(100,200),
    repository = repo
  ),
  tar_target(
    diamonds, ggplot2::diamonds,
    repository = repo
  ),
  tar_target(
    bigDiamonds, {
      Sys.sleep(5)
      ggplot2::diamonds |>
        replicate(n = n, simplify = FALSE) |>
        dplyr::bind_rows()
    },
    repository = repo,
    map = map(n)
  ),
  tar_target(
    bigDiamondMean, {
      Sys.sleep(5)
      bigDiamonds |>
        dplyr::summarise(price = mean(price))
    },
    repository = repo,
    map = map(bigDiamondMean),
    iteration = "list"
  )
)

The use case would be two parts of a pipeline that can be both executed somewhat separately yet combined in creative ways, and where we would love to spare computation rising from unforeseen combinations. It doesn't look like targets projects would help here, would they ?

Do you think it would be feasible for me to create some custom write function that would use the hash that appear tar_meta(fields = "depend") together with a syncing function at the top of my script or a custom read function that would retrieve objects based on this hash ?

PS: In hindsight, it seems to me that "shar[ing] metadata across collaborators" is not really the issue here. My toy example does not have any sort of collaboration and yet I fail to spare computation.

wlandau May 14, 2025
Maintainer

Do I understand correctly that there is no way, in the current setup, of having the following code (or its static equivalent with tar_map) to ever skip already-computed targets ?

Correct, unless you can somehow retrieve the original metadata from the target in question.

Do you think it would be feasible for me to create some custom write function that would use the hash that appear tar_meta(fields = "depend") together with a syncing function at the top of my script or a custom read function that would retrieve objects based on this hash ?

I'm not sure that would be feasible. Perhaps a constructive approach could be to write similar code to read and combine the various metadata files?

PS: In hindsight, it seems to me that "shar[ing] metadata across collaborators" is not really the issue here. My toy example does not have any sort of collaboration and yet I fail to spare computation.

True, it's more like an individual contributor who has to refer the metadata file to a previous state. But either way, the limitation of targets is the same.

katossky · 2025-05-15T17:47:02Z

katossky
May 15, 2025
Author

I got a rough sketch working. Problems left :

rows get duplicated in the historical meta file
some "restored" objects are just objects that haven't changed
children of targets that yield known objects in the middle of pipeline are not covered

library(targets)

tar_meta_backup <- function(
  from = "_targets/meta/meta",
  to   = "_cache/meta/meta"
){
  if(!file.exists(from)) stop("`from` file does not exist: ", from, call. = FALSE)
  
  new  <- data.table::fread(from, sep = "|", header = TRUE, na.strings = "")
  new$user <- Sys.info()[["user"]] %||% Sys.getenv("USER")
  new$time <- targets:::file_time_posixct(new$time)

  if (file.exists(to)) {
    new$flag    <- TRUE
    hist <- data.table::fread(to, sep = "|", header = TRUE, na.strings = "")
    hist$time <- targets:::file_time_posixct(hist$time)
    if(! "counter" %in% colnames(hist)) hist$counter <- 1L
    if(! "user" %in% colnames(hist)) hist$user <- NA_character_
    merged <- merge(
      hist, new,
      by = setdiff(colnames(new), c('time', 'seconds', 'repository', 'flag', 'user')),
      all     = TRUE,
      suffixes = c("", "New")
    )
    # update counter
    merged[!is.na(counter) & !is.na(flag), let(counter = counter+1L)]
    merged[is.na(counter) & !is.na(flag), let(counter = 1L)]
    if(nrow(merged[is.na(counter) & is.na(flag)]) > 0) stop("Can't have both flag and counter missing")
    # keep latest time, user and repository
    merged[, let(
      time    = pmax(na.rm = TRUE, time, timeNew),
      user = data.table::fcase(
        # normally if all cues match nothing should be rerun
        # the only case is when two targets are run in parallel
        # by different users
        is.na(user) | (time != timeNew), userNew,
        default = user
      ),
      repository = data.table::fcase(
        # normally if all cues match nothing should be rerun
        # the only case is when two targets are run in parallel
        # by different users
        is.na(repository) | (time != timeNew), repositoryNew,
        default = repository
      ),
      timeNew = NULL,
      userNew = NULL,
      repositoryNew = NULL
    )]
    # average execution time
    merged[, let(
      seconds = data.table::fcase(
        is.na(seconds), secondsNew,
        !is.na(secondsNew), seconds*(counter-!is.na(flag))/counter + secondsNew/counter,
        default = seconds
      ),
      secondsNew = NULL
    )]
    # remove flag
    merged[, flag := NULL]

  } else {
    new$counter <- 1L
    merged <- new
  }

  if(!dir.exists(dirname(to))) dir.create(dirname(to))
  
  file_time_reference <- as.POSIXct("1970-01-01 00:00:00", tz = "UTC")
  
  data.table::fwrite(
    merged[order(-time)][, let(
      time = sprintf(
        "t%ss",
        difftime(time, file_time_reference, units = "days") |>
          as.numeric() |> as.character()
      )
    )], # keep newest lines on top purely for convenience
    file    = to,
    sep     = "|",
    na      = "",
    quote   = FALSE
  )
}

tar_meta_restore <- function(
  repository = "_cache",
  store = targets::tar_config_get("store"),
  script = targets::tar_config_get("script")
){
  meta         <- targets:::meta_init(path_store = store)
  metaOld      <- targets:::meta_init(path_store = store)$database$read_data()
  # metaOld$time <- targets:::file_time_posixct(metaOld$time)
  histMeta     <- targets:::meta_init(path_store = repository)$database$read_data()
  file_time_reference <- as.POSIXct("1970-01-01 00:00:00", tz = "UTC")
  histMeta$time <- targets:::file_time_posixct(histMeta$time)
  pipeline     <- targets:::pipeline_init(targets:::tar_script_targets(script = script))
  target_names <- targets:::topo_sort_igraph(targets:::pipeline_produce_igraph(pipeline)) # in topological order
  restored <- 0L
  for(target_name in target_names){
    target <- targets:::pipeline_get_target(pipeline, target_name)
    targetCommand <- target$command$hash
    deps       <- target$deps
    targetDeps <- intersect(deps, target_names)
    hashes <-  sapply(targetDeps, function(t) meta$lookup[[t]]$data)
    targetDepend <-  targets:::hash_object(paste(c(names(hashes), hashes), collapse = ""))

    # do we find those objectg in the current meta or in the cache ?
    hitOld <-  metaOld[metaOld$command == targetCommand & metaOld$depend == targetDepend,]
    if(nrow(hitOld) > 1) stop("Code only works for one match in current meta")
    hit    <- histMeta[histMeta$command == targetCommand & histMeta$depend == targetDepend,]
    
    if(nrow(hitOld)){

      # if in the current meta, do nothing
      meta$insert_row(as.data.frame(hitOld))

    } else if (nrow(hit)) {
      # if in the archived meta, take the most recent copy
      row <- hit[order(hit$time, decreasing = TRUE),][1,]

      # insert into live meta
      row$time <- sprintf("t%ss", row$time |>
        difftime(file_time_reference, units = "days") |>
        as.numeric() |> as.character()
      )
      meta$insert_row(as.data.frame(row))
      restored <- restored + 1L
      message(sprintf("✔ restored %s", target_name))
      next
    } else {
      message(sprintf("✗ %s not known", target_name))
    }
  }
  # flush meta to disk
 
  meta$database$flush_rows()
  message(sprintf("%s targets restored", restored))
  meta$database$close()
}

makeScript <- function(
  price_value = "low", # high
  colour_value = "E", # D
  repo = tar_repository_cas_local(path = "_cache")
){
  eval(substitute(env = list(price_value = price_value, colour_value = colour_value, repo = repo), tar_script(ask = FALSE, code = list(
    tar_target(switch_price,   price_value, repository = repo),
    tar_target(switch_colour, colour_value, repository = repo),
    tar_target(diamonds, ggplot2::diamonds, repository = repo),
    tar_target(
      price_subset, repository = repo,
      {
        if (switch_price == "low") {
          dplyr::filter(diamonds, price < 2500)
        } else {
          dplyr::filter(diamonds, price >= 2500)
        }
      }
    ),
    tar_target(
      colour_subset, repository = repo,
      dplyr::filter(diamonds, color == switch_colour)
    ),
    tar_target(
      combined_summary, repository = repo,
      {
        tibble::tibble(
          price_switch      = switch_price,
          n_price_subset    = nrow(price_subset),
          colour_switch     = switch_colour,
          n_colour_subset   = nrow(colour_subset)
        )
      }
    )
  ))))
}

# Environnement vide initial
unlink("_cache", recursive = TRUE)
unlink("_targets", recursive = TRUE)

makeScript("low", "E") # first run
tar_make()
tar_meta_backup()

makeScript("high", "D") # second run
tar_meta_restore()
tar_sitrep()
tar_make()
tar_meta_backup()

makeScript("low", "D") # third run, should reuse parts of the previous two
tar_meta_restore()
tar_sitrep()
tar_make()
tar_meta_backup()

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CAS : cache and invalidation with `tar_repository_cas_local()` [help] #1487

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

CAS : cache and invalidation with tar_repository_cas_local() [help] #1487

Uh oh!

katossky May 13, 2025

Help

Description

Replies: 2 comments · 2 replies

Uh oh!

wlandau May 14, 2025 Maintainer

Uh oh!

katossky May 14, 2025 Author

Uh oh!

Uh oh!

wlandau May 14, 2025 Maintainer

Uh oh!

Uh oh!

katossky May 15, 2025 Author

CAS : cache and invalidation with `tar_repository_cas_local()` [help] #1487

katossky
May 13, 2025

Replies: 2 comments 2 replies

wlandau
May 14, 2025
Maintainer

katossky May 14, 2025
Author

wlandau May 14, 2025
Maintainer

katossky
May 15, 2025
Author