Skip to content

[CR] waterbutler provider for Microsoft OneDrive#102

Closed
caseyrygt wants to merge 60 commits intoCenterForOpenScience:developfrom
caseyrygt:feature/onedrive-provider
Closed

[CR] waterbutler provider for Microsoft OneDrive#102
caseyrygt wants to merge 60 commits intoCenterForOpenScience:developfrom
caseyrygt:feature/onedrive-provider

Conversation

@caseyrygt
Copy link
Contributor

Initial Work In Progress PR for Microsoft OneDrive WaterButler Provider

Notes

We copied Dropbox as the starting point for this provider. There is still some functionality in comments from Dropbox and that is not specific to OneDrive. Likewise there are debug and info logging statements that we'll remove before the final PR.

Also, invoke test passes but there are no OneDrive unit tests, yet. When I was first working in this, I just got formatting errors from invoke test and then once I got all of those fixed I saw that there are in fact WaterButler tests.

I will get tests written for OneDrive.

Design Approach

I think the biggest difference between the Dropbox and OneDrive provider (as written), is I used the OneDrive "id" for the file/folder rather than the path and name.

Limitations

  • This only works with OneDrive for personal use. OneDrive for business has a different authentication endpoints and the API is slightlightly different at this time. Additionally, Microsoft has released graph which, I believe, should (when it is complete) unify OneDrive for Personal and OneDrive for Business under one API. The OneDrive API used here should be compatible with Graph based on the documentation I've seen.
  • OneDrive's API only returns the most recent revision for an item (file or folder). I don't believe there is a way around this now.

Issues under development

  • Delete Folder
  • Download Zip
  • Rename File
  • Rename Folder
  • Copy Folder
  • Copy File

Functionality in Place (i.e. "dev complete")

  • Upload files
  • Create folders
  • Download Files
  • Delete files
  • View Revisions
  • List Files/Folders in a folder

Questions

  • OneDrive's copy function only supports async calls. To get the status of a copy request, you have to call to another API after submitting the Copy request. I think the work flow here would be:
    • Submit Copy request to OneDrive (intra_move, intra_copy)
    • Poll OneDrive for Copy Job status
    • Notify OSF that Copy job is complete.
  • Does OSF and WaterButler have any support for something like this? Or is there another way to make the user aware that there is a background process running?
  • When copying a file from one folder to another in OSF, I believe the intra_move method was actually called in WaterButler. Is there a reason intra_copy isn't called? (This may be something I need to change but nothing jumped out at me.)
  • Are there examples of working with WaterButlerPath? Sometimes I just need the file name or folder name and str(path) was the only way I could get it, but that doesn't seem like the cleanest approach.

Thanks!

…orks with a single file in the project root folder; todo: test upload to subfolder
@felliott
Copy link
Member

felliott commented Dec 3, 2015

Hey @caseyrygt,

Re: async: Yes, WB can do this. Look at waterbutler/tasks/core.py for helper code.

Re: intra_move vs. intra_copy: Where are you seeing this? Is this in v0 or v1?

Re: WaterButlerPath: No, sorry, str(path) is pretty standard, and we don't have any really good examples. The Box.com provider also uses an id rather than folder/name, so the WBPath-using code in there might prove helpful.

Cheers,
Fitz

@caseyrygt
Copy link
Contributor Author

Hey @felliott:

RE async: thanks, I'll take a look there.

RE intra_move vs Intra_copy: I just tried to reproduce it and I must have been confused or thinking I did a copy instead of a move. Sorry about that.

I'll also look at Box for the path handling.

Thanks,
-Ryan

@caseyrygt
Copy link
Contributor Author

@felliott,

I've run into an interesting snag...when creating the path parts for WaterButlerPath, I'm creating the names from the parentReference path: /ryan-test1/sub1/sub2; the ids are created from the parent ID and child ID.

So I have something like this:
names = "/ryan-test1/sub1/sub2/foobar.txt"
ids = ["1234!1", "1234!2"]

In this case when methods call path.idenfifier (e.g. download, metadata), path.identifier is none. I suspect it is because WBPath thinks there are 4 parts in total (3 folders, 1 file) and only 2 IDs were passed in.

If that's correct, I could work around it by just using the parent folder path (/sub2) and its ID.

Is my understanding correct and is there a better way to create a WaterButlerPath for OneDrive?

Thanks!

@felliott
Copy link
Member

I've run into an interesting snag...when creating the path parts for WaterButlerPath, I'm creating the names from the parentReference path: /ryan-test1/sub1/sub2; the ids are created from the parent ID and child ID.

So I have something like this:
names = "/ryan-test1/sub1/sub2/foobar.txt"
ids = ["1234!1", "1234!2"]

In this case when methods call path.idenfifier (e.g. download, metadata), path.identifier is none. I suspect it is because WBPath thinks there are 4 parts in total (3 folders, 1 file) and only 2 IDs were passed in.

If that's correct, I could work around it by just using the parent folder path (/sub2) and its ID.

Is my understanding correct and is there a better way to create a WaterButlerPath for OneDrive?

IIRC, we usually just pad the id list with None for the entries we don’t have ids for. So in your example, ids would be [None, None, “1234!1”, “1234!2”].

@caseyrygt
Copy link
Contributor Author

@felliott,

Happy New Year!

I've got the name/id path code working for most use cases except downloading a folder as a zip file.

There I'm running into an issue in download where path.identifier is None. It looks like the Zip function in core/provider is calling revalidate_path. Can you shed some light on what this function needs to do?

I looked at the box revalidate_path method and it looks like it gets the name and id for a path but I'm not 100% sure if that is everything it needs to do.

Thanks,
-Ryan

@felliott
Copy link
Member

felliott commented Jan 4, 2016

I've got the name/id path code working for most use cases except downloading a folder as a zip file.

There I'm running into an issue in download where path.identifier is None. It looks like the Zip function in core/provider is calling revalidate_path. Can you shed some light on what this function needs to do?

I looked at the box revalidate_path method and it looks like it gets the name and id for a path but I'm not 100% sure if that is everything it needs to do.

Hey Ryan,

You are correct. revalidate_path takes a base path and a file/folder name and returns a WaterButler path object for the file/folder as located under the base path. For path-based providers, this is just calls child() on the base path. For id-based providers, like Box, we need to know the id, so they need to implement revalidate_path.

Cheers,
Fitz

@TomBaxter
Copy link
Member

REDACTED see next post for details

Hello @caseyrygt . Nice work on this new provider! I have been working on an issue with another provider and was curious how you were handling it in your code, so started snooping. It seems that you may run into the same problem. Both GoogleDrive and OneDrive File/Folder IDs can have multiple parents. The provider might need to handle that possibility. I have come to a bit of a dead end trying to resolve this in GoogleDrive. I hope you are able avoid this situation or find a solution. Food for thought @felliott .

Thanks, Tom

@felliott
Copy link
Member

felliott commented Jan 4, 2016

Hello @caseyrygt . Nice work on this new provider! I have been working on an issue with another provider and was curious how you were handling it in your code, so started snooping. It seems that you may run into the same problem. Both GoogleDrive and OneDrive File/Folder IDs can have multiple parents. The provider might need to handle that possibility. I have come to a bit of a dead end trying to resolve this in GoogleDrive. I hope you are able avoid this situation or find a solution. Food for thought @felliott .

wait what oh dear oh no please why help me

That’s a great catch, @TomBaxter! Could you send me a link to docs where it talks about multiple parents? My browse-fu is weak today.

@caseyrygt, the issue we’re running into with GDrive is that its folders are not really folders, but labels. This breaks WB’s expectations of filesystem-like behavior, where /foo/bar/baz.txt and /quux/baz.txt may have identical contents and names, but are in fact separate entities. With GDrive, each file may be distinct, or it may be the same file with the same ID in two places. If they are the same entity, deleting/updating /foo/bar/baz.txt will update /quux/baz.txt, too. None of the other providers do this, so it seems dicey to allow this for just GDrive (and now OneDrive). It also has some troubling consequences in that Read+Write contributors to a project could end up modifying files outside the linked folder, which may not be what the account owner wants.

@TomBaxter, @caseyrygt: I’ve talked with our Product Manager about what we should do for this, and I think we’ve come to a decision. I’m still in the middle of writing it up formally and probing for weaknesses, but what we want to do is this: WB is supposed to be a storage-provider abstraction providing filesystem-like semantics. When I delete /foo/bar/baz.txt, nothing should happen to /quux/baz.txt, even if they are the same underlying file. We can fake this for GDrive by making DELETE operations remove the folder label instead of deleting the file. Renames and updates can be faked by removing the parent label, copying from the the source file to a new file, and making modifications to the new file.

given /foo/baz.txt and /bar/baz.txt are the same file with ID 1234, rename /bar/baz.txt to /bar/quux.txt

remove label ‘bar’ from file 1234
make a distinct copy of 1234 (assume new id is 9876)
assign 9786 label ‘bar’
assign 9786 name ‘quux.txt’

Likewise, moves and copies will work on the same principal, with mutating actions creating new copies of multi-labeled entities. I’m still working through some examples on paper to make sure this’ll work how we expect, but I think it will so far. One other change will be in the ID assignments. Since we have to support the possibility of multi-labeled files, GDrive and OneDrive ids will have to be changed to materialized ids, a cross between paths and ids. Copying from another PR response, given:

/Pictures/ (id: 1A2B)
    /Ducks/ (id: 3C4D)
        platypus.jpg (id: 5E6F)
    /Beavers/ (id: 7G8H)
        platypus.jpg (id: 5E6F)

/Pictures/Ducks/platypus.jpg would have a materialized ID of /1A2B/3C4D/5E6F, while /Pictures/Beavers/platypus.jpg would have a materialized id of /1A2B/7G8H/5E6F.

For now, I’m still figuring out the cases we’ll need to support, and making sure this approach will work. I’ll update this ticket and the GDrive one when I’ve finished. Thank you, both of you!

Cheers,
Fitz

@caseyrygt
Copy link
Contributor Author

@TomBaxter,

That is very interesting. Thanks for bringing it up. Could you provide a link to the OneDrive (Personal Edition) documentation for labels / multiple parents?

I went through the API just now and couldn't find anything regarding labels. (API: https://dev.onedrive.com/README.htm)

Thanks,
-Ryan

@TomBaxter
Copy link
Member

@caseyrygt @felliott

I apologize for jumping the gun with my post. I thought I had found several references to multiple OneDrive parents yesterday but have been unable to find them this morning. I suspect that my google searches of OneDrive were returning hits for GoogleDrive.

I dove into the OneDrive API and see now that they use "remoteItem" to reference shares instead of using multiple "parentReference" entries and as near as I can tell they don't support symbolic links within a single "drive" at all, unless perhaps that is also done with the "remoteItem".

Again I apologise for any anxiety I caused by bringing this up prematurely.

Embarrassingly, Tom

@felliott
Copy link
Member

felliott commented Jan 5, 2016

I apologize for jumping the gun with my post. I thought I had found several references to multiple OneDrive parents yesterday but have been unable to find them this morning. I suspect that my google searches of OneDrive were returning hits for GoogleDrive.
I dove into the OneDrive API and see now that they use "remoteItem" to reference shares instead of using multiple "parentReference" entries and as near as I can tell they don't support symbolic links within a single "drive" at all, unless perhaps that is also done with the "remoteItem".
Again I apologise for any anxiety I caused by bringing this up prematurely.

Embarrassingly, Tom

No worries, @TomBaxter! That’s a relief to hear.

@caseyrygt: Sorry, I’ve lost track. Is this PR ready for review again?

Cheers,
Fitz

@caseyrygt
Copy link
Contributor Author

@TomBaxter, no worries! I'm glad you were thinking about other providers
as well.

On Tue, Jan 5, 2016 at 9:38 AM, Fitz Elliott notifications@github.com
wrote:

@caseyrygt: Sorry, I’ve lost track. Is this PR ready for review again?

@felliott, let me implement revalidate_path first. I did have a question
about the backgroundify method(s):

Can you take a look at intra_copy? Basically, I did not want to return to
OSF before OneDrive had finished copying the files using its async method.
So I wrote a loop to check up to 100 times if the file copy is complete.
It's not elegant, but I didn't see any example code of using the
backgroundify method and NOT returning control/flow to OSF. Does that make
sense?

If you want to look at intra_copy once everything else is done, I
understand.

Thanks,
-Ryan

@felliott
Copy link
Member

felliott commented Jan 5, 2016

@felliott, let me implement revalidate_path first.

Sounds good.

I did have a question about the backgroundify method(s):

Can you take a look at intra_copy? Basically, I did not want to return to
OSF before OneDrive had finished copying the files using its async method.
So I wrote a loop to check up to 100 times if the file copy is complete.
It's not elegant, but I didn't see any example code of using the
backgroundify method and NOT returning control/flow to OSF. Does that make
sense?

If you want to look at intra_copy once everything else is done, I
understand.

I don’t know off the top of my head, but I’ve made a note to look at it. I’m going to be spending all day writing stuff up, so it might be tomorrow or Thursday until I get back to you. Thanks!

Cheers,
Fitz

@caseyrygt
Copy link
Contributor Author

Since this has gotten long and I believe all of the issues here have been address, I'm going to close this and submit a new PR with a new question.

Thanks!

@caseyrygt caseyrygt closed this Feb 25, 2016
felliott added a commit that referenced this pull request Nov 13, 2017
 This feature was done as a COS-funded Integration Grant by:
  * Ryan Casey (@caseyrygt)

 Additional work was done on read-write support by Alexandr Melnikov
 (@alexandr-melnikov-dev-pro).  Read-write will be added in a future
 release.

 Thank you both for all your hard work on this!

 [SVCS-269] [SVCS-458]
 Closes: #205
 Related: #102, #151, #177
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants