Replies: 2 comments 2 replies
-
The CAS system in ``targets` has 2 goals:
Unfortunately, (1) requires a way to share metadata across collaborators. You're right to point out that this is a limitation and a source of friction for collaborative workflows. Although later/advanced features of Git does not necessarily have to be the mechanism for sharing metadata, it just happens to be a rigorous one. On the other hand, it can be hard to think of an alternative.
This is actually expected behavior. If you run: makeScript(times = 100, repo = tar_repository_cas_local(path = "_cache")) # first version
tar_make()
makeScript(times = 200, repo = tar_repository_cas_local(path = "_cache")) # second version
tar_make()
makeScript(times = 100, repo = tar_repository_cas_local(path = "_cache")) # switch back to first version
tar_sitrep() # Show why targets are outdated / up to date you will see: # A tibble: 4 × 11
name meta always never command depend format repository iteration file seed
<chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 bigDiamonds FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 n FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
3 diamonds FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4 bigDiamondMean FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE In the output of |
Beta Was this translation helpful? Give feedback.
-
I got a rough sketch working. Problems left :
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Help
Description
I find myself in the case described in #1232 of a versionned cache shared between mutliple users. In our case, the pipeline (simulating the French retirement system) has variants (economic, demographic and ligislative hypotheses as well as modelling choices, combining to thousands of theoretically possible pipelines) and we frequently switch between a small subset of them as we work on a given project. Some early steps of the pipeline do not change much yet require hours of computation (simulating millions of careers under said hypotheses). Typically, a late question from client would require me to add a new variant to one of my colleagues work, variant that shares 90% of the computation with existing ones. I don't think that branching over variants works well in our case as it would result in an absurd number of branches, of which most would never be evaluated. Using custom
repository
intar_target
seemed like the perfect caching solution, but my experiments have not worked as expected — prompting me to wonder whether I have misunderstood the intended behavior of a CAS system.Below is a toy example of a parametrizable pipeline (where parameter is
n
) and where storage happens in the_ cache
repository. Only thediamonds
target ever get skipped even though by run 3 all targets already exist in_cache
. I would have expected the other targets to be skipped as well.Given the high quality of
targets
, I am certain this behavior is intentional. That said, I am not sure I understand the benefit of adressable storage without caching. To me, working in a team, the idea would indeed be that anytime 2 people run 2 pipelines with some shared steps, they benefit from each-other's already-computed targets. (I for instance expected that the hash used for adressable storage was the invalidation hash, computed from the upstream targets, because for a proper hash we would need to actuall compute the target!) The ambiguity also seems to stem from the very terminology used in issues and dicussions, especially around the often-used word “cache,” which to me typically suggests that unnecessary computation are skipped. Not sure that the reader of help page fortar_repository_cas_local
would understand that CAS is not cache.I understand that one can version the metadata file to obtain proper invalidation, but in our case that feels heavy: we would need multiple git branches for each variant and merge updates from master frequently as the rest of the codebase evolves. Moreover, our use case doesn't fundamentally require Git — users could in principle simply install our package and explore different variants without caring about version control, while still benefiting greatly from effective caching.
Could someone clarify whether I am missing a key feature (like a clever way of branching, or an other type of CAS), whether the kind of caching I have in mind is within the scope of targets (in case it does not already exist), and/or whether there are workarounds? If a viable approach exists, my team is seriously considering moving to targets, and I would then be happy to invest time into contributing.
Beta Was this translation helpful? Give feedback.
All reactions