The Wayback Machine - https://web.archive.org/web/20220530091551/https://github.com/rclone/rclone/pull/6133
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat webdav nextcloud chunked v2 #6133

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

devnoname120
Copy link

@devnoname120 devnoname120 commented Apr 28, 2022

Supersedes #4865.
Fixes #3666.

Follow-up PR with parallel chunks upload: #6179

Compared to the previous PR:

  • Rebase changes on top of master & solve conflicts.
  • Fix uploading failures when the upload folder already exists. This happens when rclone is closed abruptly.
  • Fix chunks merge retries. NextCloud seems to sometimes need a bit of time before it's able to execute the MOVE operation without erroring out with 404.
  • Rename chunk_size config to nextcloud_chunk_size and set it to NextCloud's default: 10 MiB.
  • Set the chunked upload content type to application/x-www-form-urlencoded in order to follow the recommandations of NextCloud.

Known limitations:

  • Chunk uploads can't retry. See golang/go#52603. If it doesn't get merged soon enough I'll do a workaround in this PR.
    • Update: I implemented a workaround which uses readers.NewRepeatableLimitReaderBuffer(). This creates a buffer of chunk size, so it may use lots of memory if chunk size is set to a big value in the remote config. Maybe we could obtain a seekable reader from the upper-level abstraction instead of creating our own wrapper.
  • NextCloud keeps some weird locks on files which sometimes results in the error 423 Locked. Need to figure out why this happens and if there is a workaround.
  • If an upload is aborted, it needs to be restarted from the beginning. Instead we could hash the chunks (maybe do a first pass of size comparison) to check whether they were correctly uploaded.
    • ⚠️ We would need to make sure to delete previously-uploaded chunks that are not in the current list of chunks. This is important if the chunk size was modified between invocations to avoid corrupting the file on merge with irrelevant chunks.

TODO:

  • When merging a transfer, if we receive 504 then start polling the MOVE operation regularly until it's not LOCKED anymore + output file exists. With a timeout. We have to do this because if we return a success then rclone checks the file sees it's not correct and restarts the transfer.
  • Don't purge long MOVE operations to workaround server timeouts (504 Gateway Time-out then 423 Locked).
  • [ ] [In a follow-up PR because this is independent from these changes] Check the hash of the uploaded file to make sure it was uploaded properly?
    • rclone just checks the size and modified date by default, which is kinda crappy. Checking the hash should be easy (iirc NextCloud just stores it so it's free performance-wise to fetch it), and since we upload the whole thing we can either get the hash from the remote (if it's free to get) or compute it on-the-fly when uploading stuff.
  • Add tests
  • Fix crash #6133 (comment) (see fix: 5cf9b18)
  • [ ] [In as follow-up PR, see https://github.com/devnoname120/rclone/tree/feat-webdav-nextcloud-parallel-chunk-upload] Upload chunks in parallel to boost the speed of sending big files with small chunks.
    • Maybe see

      rclone/backend/b2/b2.go

      Lines 1214 to 1233 in 1d6d41f

      // Delete Config.Transfers in parallel
      toBeDeleted := make(chan *api.File, f.ci.Transfers)
      var wg sync.WaitGroup
      wg.Add(f.ci.Transfers)
      for i := 0; i < f.ci.Transfers; i++ {
      go func() {
      defer wg.Done()
      for object := range toBeDeleted {
      oi, err := f.newObjectWithInfo(ctx, object.Name, object)
      if err != nil {
      fs.Errorf(object.Name, "Can't create object %v", err)
      continue
      }
      tr := accounting.Stats(ctx).NewCheckingTransfer(oi)
      err = f.deleteByID(ctx, object.ID, object.Name)
      checkErr(err)
      tr.Done(ctx, err)
      }
      }()
      }
      or
      // Transfer the chunk
      wg.Add(1)
      o.fs.uploadToken.Get()
      go func(part int, position int64) {
      defer wg.Done()
      defer o.fs.uploadToken.Put()
      fs.Debugf(o, "Uploading part %d/%d offset %v/%v part size %v", part+1, session.TotalParts, fs.SizeSuffix(position), fs.SizeSuffix(size), fs.SizeSuffix(chunkSize))
      partResponse, err := o.uploadPart(ctx, session.ID, position, size, buf, wrap, options...)
      if err != nil {
      err = fmt.Errorf("multipart upload failed to upload part: %w", err)
      select {
      case errs <- err:
      default:
      }
      return
      }
      parts[part] = partResponse.Part
      }(part, position)
      for inspiration.
    • A major annoying limitation is that rclone provides use with an io.Reader instead of a io.ReadSeeker. So we need to fetch chunks data in order, and only then start the parallel jobs. Since the io.Reader may be from a remote, fetching the chunks in order and storing them temporarily would lower the speed (can't download the chunks in parallel). I'm not sure if there is a good reason rclone doesn't provide an io.ReadSeeker. Maybe it's a just matter of modifying rclone's core code to make it provide an io.ReadSeeker without major modifications.
  • [ ] [In a follow-up Pr] Implement fs.CleanUp() to purge all the temporary upload folders.

What is the purpose of this change?

Enable chunked uploads for Nextcloud (WebDav).
See https://docs.nextcloud.com/server/latest/developer_manual/client_apis/WebDAV/chunking.html

Was the change discussed in an issue or in the forum before?

#3666
#4865

Checklist

  • I have read the contribution guidelines.
  • I have added tests for all changes in this PR if appropriate.
  • I have added documentation for the changes if appropriate.
  • All commit messages are in house style.
  • I'm done, this Pull Request is ready for review :-)

@devnoname120 devnoname120 marked this pull request as draft Apr 28, 2022
@devnoname120
Copy link
Author

@devnoname120 devnoname120 commented May 1, 2022

All chunked uploads are stuck to 100% now that I've fixed the retry mechanisms. I'll figure out the issue here.

@alkenerly
Copy link

@alkenerly alkenerly commented May 3, 2022

Hitting an issue when copying a file to my nextcloud with this branch. Logs are pasted below.

2022-05-02 20:12:24 DEBUG : pacer: low level retry 10/10 (error File with name //GOPR0620.MP4 could not be located: Sabre\DAV\Exception\NotFound: 404 Not Found)
2022-05-02 20:12:24 ERROR : Attempt 1/3 failed with 1 errors and: read metadata failed: File with name //GOPR0620.MP4 could not be located: Sabre\DAV\Exception\NotFound: 404 Not Found

The issue seems to be that readMetaDataForPath in webdav.go used to rely on getting a 404 error code to indicate that nothing existed at the given path. But now that 404 is listed in the retry codes, it ultimately fails retrieving the metadata. I "fixed" this issue by removing the 404 from the error codes, but eventually encountered the problem where nextcloud returns a 404 when trying to finalize the chunk upload and rclone decides to reupload the entire file instead of retrying.

If these comments are not the correct place for these comments, definitely let me know.

Thanks for your work on this feature. It'll really help me out!

@devnoname120
Copy link
Author

@devnoname120 devnoname120 commented May 3, 2022

@alkenerly This is the right place. Thanks a lot for testing, I'll look into it.

backend/webdav/webdav.go Outdated Show resolved Hide resolved
@devnoname120
Copy link
Author

@devnoname120 devnoname120 commented May 11, 2022

@alkenerly Can you retry with the latest changes?

@devnoname120
Copy link
Author

@devnoname120 devnoname120 commented May 12, 2022

It sometimes crashes here, not sure why (chunk size: 100 MB). Will need to investigate.

Note that both captures/file1.avi and captures/file5.avi were transferred successfully if I trust the file sizes (didn't verify the checksums).

2022-05-11 18:31:15 DEBUG : webdav root 'captures_dv_hi8': Waiting for transfers to finish
2022-05-11 20:29:46 DEBUG : captures/file5.avi: Cancelling chunked upload
Transferred:   	  255.318 GiB / 345.650 GiB, 74%, 34.860 MiB/s, ETA 44m13s
Checks:              1074 / 1074, 100%
Transferred:            0 / 9, 0%
Elapsed time:   1h58m36.7s
Transferring:
 *  captures/file1.avi:100% /30.992Gi, 2.207Mi/s, 0s
 * captures/file2.avi: 95% /32.003Gi, 4.562Mi/s, 5m0s
 * captures/file3.avi: 68% /46.601Gi, 5.659Mi/s, 44m22s
 * captures/file4.avi: 74% /42.622Gi, 6.148Mi/s, 30m17s
 * captures/file5.avi:100% /32.546Gi, 0/s, 0s
 *        captures/file6.avi: 66% /48.174Gi, 4.836Mi/s, 56m37s
 *        captures/file7.avi: 69% /49.428Gi, 5.798Mi/s, 43m44s
 *        captures/file8.avi: 94% /32.462Gi, 5.747Mi/s, 4m52spanic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x10 pc=0x1035eb14c]

goroutine 73 [running]:
github.com/rclone/rclone/backend/webdav.(*Fs).shouldRetryChunkMerge(0x140002f21e0?, {0x10414fa90?, 0x140000633c0?}, 0x1400be11400?, {0x104141358?, 0x14000c22540?})
	/Users/paul/dev/rclone2/backend/webdav/webdav.go:249 +0x2c
github.com/rclone/rclone/backend/webdav.(*Object).updateChunked.func4()
	/Users/paul/dev/rclone2/backend/webdav/webdav.go:1518 +0xd4
github.com/rclone/rclone/fs.pacerInvoker(0x1400004e120?, 0x1025ba8b0?, 0x140007cedc0?)
	/Users/paul/dev/rclone2/fs/pacer.go:88 +0x30
github.com/rclone/rclone/lib/pacer.(*Pacer).call(0x1400004e120, 0x140007ec081?, 0xa)
	/Users/paul/dev/rclone2/lib/pacer/pacer.go:197 +0x84
github.com/rclone/rclone/lib/pacer.(*Pacer).Call(0x1400004e120, 0x14000c22b10?)
	/Users/paul/dev/rclone2/lib/pacer/pacer.go:216 +0x108
github.com/rclone/rclone/backend/webdav.(*Object).updateChunked(0x1400075d4a0, {0x10414fa90?, 0x140000633c0}, {0x10413f158?, 0x140007a6000}, {0x12d6b4008, 0x14000335b60}, {0x140004aab40, 0x1, 0x1})
	/Users/paul/dev/rclone2/backend/webdav/webdav.go:1516 +0xea8
github.com/rclone/rclone/backend/webdav.(*Object).Update(0x1400075d4a0, {0x10414fa90, 0x140000633c0}, {0x10413f158, 0x140007a6000}, {0x12d6b4008?, 0x14000335b60?}, {0x140004aab40, 0x1, 0x1})
	/Users/paul/dev/rclone2/backend/webdav/webdav.go:1353 +0x134
github.com/rclone/rclone/backend/webdav.(*Fs).Put(0x14000a02120, {0x10414fa90, 0x140000633c0}, {0x10413f158, 0x140007a6000}, {0x12d6b4008?, 0x14000335b60?}, {0x140004aab40, 0x1, 0x1})
	/Users/paul/dev/rclone2/backend/webdav/webdav.go:853 +0x158
github.com/rclone/rclone/fs/operations.Copy({0x10414fa90, 0x140000633c0}, {0x104160388, 0x14000a02120}, {0x0?, 0x0?}, {0x14000121180, 0x31}, {0x1041603f8?, 0x14000335b60})
	/Users/paul/dev/rclone2/fs/operations/operations.go:472 +0x1320
github.com/rclone/rclone/fs/sync.(*syncCopyMove).pairCopyOrMove(0x1400057e000, {0x10414fa90, 0x140000633c0}, 0x0?, {0x104160388, 0x14000a02120}, 0x0?, 0x0?)
	/Users/paul/dev/rclone2/fs/sync/sync.go:403 +0x12c
created by github.com/rclone/rclone/fs/sync.(*syncCopyMove).startTransfers
	/Users/paul/dev/rclone2/fs/sync/sync.go:430 +0x44

@alkenerly
Copy link

@alkenerly alkenerly commented May 12, 2022

Running some long running transfers now. I am moving approx 4GB files using a 20mbit connection. So not only are the files large, my transfers will also stress the timeouts of both servers.

Tentatively: it works! with an exception

Sometimes, the transfer works exactly how you would expect it to!

Other times, it gets a 423 Locked when finalizing chunks
"user/files/path/to/file.upload.part" is locked: OCA\DAV\Connector\Sabre\Exception\FileLocked: 423 Locked.

That error repeats itself 10 times in quick succession (with the pacer rate limit never causing sleep longer than 2s). After 10 of those errors it cancels the chunked upload. However, when it attempts to restart the upload, it seems to figure out that the file was indeed uploaded correctly. It would appear that rclone successfully uploaded the file and nextcloud successfully assembled the chunks.

Interestingly, the first low level retry is not triggered by a 423 file locked exception, but by a gateway 504 timeout.

2022-05-12 18:50:17 DEBUG : pacer: low level retry 1/10 (error <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>: 504 Gateway Timeout)

I bet the first request to get the chunks to merge was successfully received, but for whatever reason I get a 504. That causes the subsequent requests to report that the files are locked (which makes sense because nextcloud is performing the merge operation on them).

Maybe there isn't a practical way around this except for each user hunting down every kind of timeout that exists on their webserver / reverse proxies etc. If only there were a "hey nextcloud, are you currently merging chunks from this directory right now?" API...

Do you have any ideas?

@devnoname120
Copy link
Author

@devnoname120 devnoname120 commented May 13, 2022

@alkenerly Could you open an issue on NextCloud's repo? It may be possible to do a workaround but that would be super brittle and/or slow.

@alkenerly
Copy link

@alkenerly alkenerly commented May 13, 2022

@devnoname120 Just so I can clarify the issue before opening an issue over there:

I would think that a 504 is returned when there is a request that takes a long time to respond to. The request that is taking so long in this instance is probably the HTTP MOVE command that assembles the chunks.

There would not be any other long outstanding requests other than that, MOVE, right? It is weird because that MOVE command probably takes way less outstanding time than uploading one chunk's worth of data at my slow data rate.

@juliushaertl
Copy link

@juliushaertl juliushaertl commented May 13, 2022

I may jump in here right away as I seen your development happening as a Nextcloud dev 😉 Really cool to see active work on the chunked upload.

The gateway timeout is usually something that can only be solved by the server configuration where the frontend webserver or load balancer has a too short timeout set for requests. On the PHP side those requests do not get cancelled until the PHP maximum execution time is reached (which is 60 minutes by default with Nextcloud: https://github.com/nextcloud/server/blob/f110ff9f4d54c4124483f3ba0e6c681362b79a50/lib/base.php#L624)

From the above description it seems like the following is happening:

  • First MOVE requests gets executed
  • The webserver request to the PHP backend times out
  • The PHP process performing the MOVE is still active, therefore the file is still locked
  • The next attempt to MOVE will fail as there is still one ongoing

The general recommendation would be to adjust the server configuration to increase the timeout.

@juliushaertl
Copy link

@juliushaertl juliushaertl commented May 13, 2022

There would not be any other long outstanding requests other than that, MOVE, right? It is weird because that MOVE command probably takes way less outstanding time than uploading one chunk's worth of data at my slow data rate.

The PUT requests are a bit different here as there is data transferred during the time, so frontend Webservers usually keep the connection open in that case, while for the MOVE the connection is basically idle between client (rclone) and webserver as it just waits for the PHP operation to finish and return a HTTP response.

@alkenerly
Copy link

@alkenerly alkenerly commented May 13, 2022

@juliushaertl

Thanks for jumping in! I am uploading another set of chunks now but am pausing rclone's execution before it issues the MOVE command. My plan is to issue the last MOVE command myself probably using curl so that I can get a rough time until I get the 504.

I am running my nextcloud behind an nginx reverse proxy. So it will be a little scavenger hunt to determine which timeout I need to increase in order to get around this issue.

All that being said, I haven't personally found an issue with this branch (because this issue I have is a server configuration issue). Great work @devnoname120

@devnoname120
Copy link
Author

@devnoname120 devnoname120 commented May 13, 2022

@juliushaertl Would there be a way to check the state of the MOVE operation? It could maybe return an operation id that I could then poll?

Some workarounds I can do on my side:

  • I could rely on the 'locked' status but I've seen (hopefully resolved?) issues on NextCloud's issue tracker where it didn't properly unlock files sometimes. So it's not reliable enough for rclone to use that (even if those bugs are solved for now, it may break again in the future for other reasons). Right now rclone assumes that the upload fails and issues a DELETE request (which is really terrible because it deletes the file even if it was just being updated, which causes data loss. That's independent of my changes though so it's for another PR.)

  • Another strategy would be to poll the file status/date/hash for 30 seconds (then give up). But again it would be pretty annoying when updating a file, because I would need to figure if the upload got corrupted, or if the MOVE operation is still running.

@alkenerly
Copy link

@alkenerly alkenerly commented May 14, 2022

I have increased the relevant timeouts in my environment to prevent getting 504s on chunk assembly. I'll leave this branch transferring files of various sizes for a bit and will report any issues I hit.

Thanks again for the work you put in

@alkenerly
Copy link

@alkenerly alkenerly commented May 16, 2022

Transferred a mix of chunkable and non chunkable files over the last few days and have had no issues

@devnoname120
Copy link
Author

@devnoname120 devnoname120 commented May 16, 2022

Transferred a mix of chunkable and non chunkable files over the last few days and have had no issues

Thanks a lot! Since then I implemented parallel chunk upload (see https://github.com/devnoname120/rclone/tree/feat-webdav-nextcloud-parallel-chunk-upload). There are two issues remaining:

  • Stopping the upload doesn't correctly purge the upload folder.
  • The progress shows the progress of chunk download instead of chunk upload.

So it will be for a next pull request. I think that this pull request is functional enough that it makes sense to keep it as is.

I'll add some tests and clean up the code a bit, then I'll ask for reviews from the maintainers.

@devnoname120
Copy link
Author

@devnoname120 devnoname120 commented May 16, 2022

@alkenerly Can you confirm that it still works with my latest changes? There should be no difference in behavior, I just made the code cleaner.

@juliushaertl
Copy link

@juliushaertl juliushaertl commented May 17, 2022

Some workarounds I can do on my side:

  • I could rely on the 'locked' status but I've seen (hopefully resolved?) issues on NextCloud's issue tracker where it didn't properly unlock files sometimes. So it's not reliable enough for rclone to use that (even if those bugs are solved for now, it may break again in the future for other reasons). Right now rclone assumes that the upload fails and issues a DELETE request (which is really terrible because it deletes the file even if it was just being updated, which causes data loss. That's independent of my changes though so it's for another PR.)

There may always be scenarios where file locks can stay around for longer, e.g. if the php process gets killed for some reason. However those should be cleaned up by a background job (if the database is used as a locking backend) or by its set TTL if redis is used as a locking backend. The TTL has a default of 1 hour, but is also configurable on the Nextcloud config.php (filelocking.ttl).

However I agree that it is not the ideal way to monitor if a move is still pending.

Would there be a way to check the state of the MOVE operation? It could maybe return an operation id that I could then poll?

Currently there is not really such a thing as a dedicated status check of the operation. During the move operation you would as mentioned receive a 409 locked status code, once it is finished the temporary upload folder will get removed and then you should get a 404 as the source folder for the move is no longer found.

@devnoname120
Copy link
Author

@devnoname120 devnoname120 commented May 17, 2022

@juliushaertl Thank you for your reply. It would be great if this feature could be added to NextCloud.
In the meantime I'll mark these transfers as ‶failed″ but without cleaning them up. This will make sure that users won't get stuck with files that can never transfer.

@devnoname120
Copy link
Author

@devnoname120 devnoname120 commented May 18, 2022

@ncw Could you do a first review pass on this pull request? I'll then add tests. Thanks!

@devnoname120 devnoname120 marked this pull request as ready for review May 18, 2022
Copy link
Member

@ncw ncw left a comment

I think this is looking nice :-)

I put a comment inline.

Looking at the commits, this needs a lot of tidying up!

I think we should have one commit each for the preliminary work, ie

  • lib/rest
  • fstests

I don't think cmd/cmount should be here.

I'm not sure how many commits the backend/webdav code should be.

I know its been through a few different authors, so if you squash commits with different authors then add a Co-authored-by: Name1 Name2 <[email protected]> line at the end of the commit after a blank line - that will keep the authorship straight.

I don't know how experienced a git user you are, so if you'd like me to do the squashing I can.

I'll run the CI now too and we can see if that shows up anything else.

Thank you :-)

cmd/cmount/mountpoint_windows.go Outdated Show resolved Hide resolved
@devnoname120 devnoname120 force-pushed the feat-webdav-nextcloud-chunked-v2 branch 2 times, most recently from 1a68409 to 96e58d9 Compare May 20, 2022
@devnoname120 devnoname120 force-pushed the feat-webdav-nextcloud-chunked-v2 branch from 96e58d9 to 2783ac1 Compare May 20, 2022
@devnoname120
Copy link
Author

@devnoname120 devnoname120 commented May 20, 2022

@ncw Thank you for your comments. I addressed the changes + a lint formatting issue.
Could you run the CI again?

@devnoname120 devnoname120 force-pushed the feat-webdav-nextcloud-chunked-v2 branch from 2783ac1 to 6b0c8ac Compare May 20, 2022
@ncw
Copy link
Member

@ncw ncw commented May 20, 2022

@ncw Thank you for your comments. I addressed the changes + a lint formatting issue. Could you run the CI again?

Done!

Are you ready for another review?

@devnoname120
Copy link
Author

@devnoname120 devnoname120 commented May 20, 2022

Are you ready for another review?

Yes!

My main concern is to test low-level HTTP/2 retries. I looked into how to make a mock so as to simulate a GOAWAY but it looks complicated and I'm not familiar enough with Go (this PR is my hello world project, I haven't written any Go beforehand).

Would you mind taking over this part?

Thank you.

@devnoname120 devnoname120 marked this pull request as draft May 26, 2022
@makefu
Copy link

@makefu makefu commented May 27, 2022

Hey, i would love to see this feature upstream as right now i have some bad issues when trying to upload to nextcloud files larger than 50mb!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants