Skip to content

Create ML-as-a-service.md#399

Closed
MikeMS-sys wants to merge 5 commits intow3f:masterfrom
uddugteam:master
Closed

Create ML-as-a-service.md#399
MikeMS-sys wants to merge 5 commits intow3f:masterfrom
uddugteam:master

Conversation

@MikeMS-sys
Copy link
Copy Markdown
Contributor

Grant Application Checklist

  • The application template has been copied, renamed (project_name.md) and updated.
  • A BTC or Ethereum (DAI) address for the payment of the milestones is provided inside the application.
  • I have read and acknowledged the Terms and Conditions.
  • The software delivered for this grant will be released under an open-source license specified in the application.
  • The total funding amount of the project is below USD $30k for initial grant applications and $100k for follow-up grants.
  • The initial PR contains only one commit (squash if needed before submitting your PR).
  • The grant will only be announced once the first milestone has been accepted.

@MikeMS-sys MikeMS-sys closed this May 3, 2021
@MikeMS-sys MikeMS-sys reopened this May 3, 2021
@semuelle semuelle self-assigned this May 3, 2021
@semuelle
Copy link
Copy Markdown
Contributor

semuelle commented May 3, 2021

Hi @MikeMS-sys, thank you for your application. We will look into it as soon as possible.

If you would be so kind as to review the specification of your deliverables so that every deliverable is measurable and verifiable, that would help us enormously with evaluation later on. We will update [...], for example, is not a deliverable.

Here are some more questions & suggestions based on a quick glance at your document:

  • Will the ML algorithms be implemented as offchain workers?
  • Either way, what's the point of bringing blockchain into this? If I distrust your model, I can train one locally with your data. If I distrust your data, I am not interested in your model.
  • Some models require hundreds of gigabytes of data, what's your use case/motivation for asking someone to upload their data so that some blockchain node can run an algorithm on it that's already implemented in Rust?

@MikeMS-sys
Copy link
Copy Markdown
Contributor Author

MikeMS-sys commented May 7, 2021

Hi @semuelle, thank you for your suggestions.

We’re looking to implement offchain and onchain calculations to make the decision optional for developers in case of implementing machine learning in substrate-based projects.

  • Offchain workers useful for implementing a separate ML service and for some offchain work with data and predictions (or other results).
  • Onchain workers useful to provide full blockchain powered service when data or any data pointers (ipfs hashes, etc.) become available only in blockchain.

Pallet suitable for projects where users need the power of a communal neural network while knowing their data is protected. Worked a lot with blockchain technologies our team found that both technologies are data-driven, and thus there are rapidly growing interests in integrating them for more secure and efficient data sharing and analysis.

We want to realise this idea as the core part of our project in healthcare sphere - Trusted Health Council.. Users there are available securely share data with blockchain and get predictions. Neural network education process become better with any new user data. Nobody knows the data owners, because all the data anonymised.

Roadmap update

Milestone 1 - Proof of concept

  • Estimated Duration: 1.5 months
  • FTE: 3
  • Costs: $14 000
  1. Substrate ML pallet
  • Generate predictions based on Random Forest algorithm
  • All data stores onchain
  1. Web application
  • Interacting with blockchain
  • Form with fields to upload user data into Ml pallet
  • Handle event with prediction
  1. All code will have proper unit-test coverage to ensure functionality and robustness.
  2. Complex quality Assurance for all platform features.
  3. Docker image with testing Substrate chain with integrated ML pallet, demonstrating its functionality.
  4. Documentation of the code and a basic tutorial describing how the software can be used and tested.

Milestone 2 - Production ready

  • Estimated Duration: 1.5 months
  • FTE: 3
  • Costs: $14 000
  1. Substrate ML pallet
  • Implement all ML algorithms from smartcore lib
  • Integrate OrbitDb and add allowance to store data in IPFS
  • Data encryption module
  • Manage access to users predictions results and provided data
  1. Web application
  • Functionality to select current ML algorithm
  • Flag to encrypt user data
  • Access to IPFS data by hash
  1. All code will have proper unit-test coverage to ensure functionality and robustness.
  2. Complex quality Assurance for all platform features.
  3. Docker image with a new version of testing Substrate chain, demonstrating its functionality.
  4. Documentation of the code and a basic tutorial describing how the software can be used and tested.

@alxs
Copy link
Copy Markdown
Contributor

alxs commented May 10, 2021

Hi @MikeMS-sys. I'm only quickly jumping it to post a link to your last application to the General Grants Program, for reference: w3f/General-Grants-Program#413.

Besides, could you please update the application itself? It would also be helpful if you could structure the deliverables tabularly as in the template, and include deliverables 0a-c in each milestone.

Lastly I would also add that your deliverables and the application in general should still include far more details. You have barely updated them whereas they need a complete overhaul. You may treat this it as a contract; the level of detail must be enough to later verify that the software meets the specification. You can find some examples of what we're interested in for different grant categories here and have a look at this somewhat related application and its deliverables or any of the ones mentioned in the README for reference.

And could you specify what you mean by

Onchain workers [are] useful to provide full blockchain powered service when data or any data pointers (ipfs hashes, etc.) become available only in blockchain.

Since both the data and the algorithms required for ML would be far too resource-intensive to be run on-chain. What's your thinking behind this? Also data referenced via e.g. an IPFS hash would be accessed via an off-chain worker and clearly cannot be retrieved on chain. Could you clarify what you mean?

@semuelle
Copy link
Copy Markdown
Contributor

  • Pallet suitable for projects where users need the power of a communal neural network while knowing their data is protected.

Can you expand on that? A communal neural network is a model that anyone has access to, or is there more? If I'm worried about my data being protected, wouldn't I just build my own model or download it and run it locally?

  • Neural network education process become better with any new user data.

But models are usually trained with data that is verified and often selected from a small population slice.

  • Data encryption module

What is encrypted, and where? In the browser before upload? If I used someone else's model, wouldn't I want to have access to the data it was trained with? How do you re-train a model with two separately encrypted datasets?

@MikeMS-sys
Copy link
Copy Markdown
Contributor Author

Dear @alxs and @semuelle,
Тhank you once again for the comprehensive recommendations. We work on the application and update it for further steps.

@semuelle semuelle added the changes requested The team needs to clarify a few things first. label May 17, 2021
@MikeMS-sys
Copy link
Copy Markdown
Contributor Author

Returning to our conversation @alxs and @semuelle we have reflected on the earlier questions and have updated the application.

@semuelle Data encryption module unfortunately was wrongly included into the application.

@semuelle
Copy link
Copy Markdown
Contributor

Hi @MikeMS-sys, thanks for the update. The repo containing the images (example) seems private though. I cannot access them.

@semuelle
Copy link
Copy Markdown
Contributor

semuelle commented May 26, 2021

Thanks for the update. Do I understand correctly that I have to pass my training data to the node via transaction, which then stores it off-chain? Why? Why don't I store it on IPFS myself and then reference it via hash? That sounds like a massive bottleneck.

Neural network education process become better with any new user data.

Is there anything preventing people from polluting my model with wrong or fake data?

@burdges
Copy link
Copy Markdown
Contributor

burdges commented May 28, 2021

Please provide github URLs for all team members. LinkedIn URLs have no value in demonstrating team member abilities.

Afaik, there is never much if any value in doing machine learning on a blockchain. There is no need for a public source of truth since by definition machine learning models extract features from statistical samples.

Instead, if one really needs secrecy, either services provide a proprietary obfuscated model directly to users, or users provide their own masked data to services. All this falls into the adversarial machine learning field, which evolves quite quickly these days.

It's pretty trivial to obtain a less biased sample population than blockchain users, but if one day blockchains become really widely used then it's plausible one wants cryptography like group/ring VRFs when sampling, but even then if one used blockchain accounts for sampling one never touches the blockchain itself, only proves account existence in zero-knowledge.

@MikeMS-sys
Copy link
Copy Markdown
Contributor Author

@semuelle Data receiving process for machine learning algorithms here is our way to prevent spam attacks or fake data from intruders. Our transaction based on a specific format and for data transfer users can only use this format.

Future plans - implement validation module.

I certainly agree with you @burdges but nevertheless I am sure that this idea quite has its place in life especially for supported private projects or unique solutions.

Team web site https://uddug.com

Andrew Skurlatov (technical lead)
Github https://github.com/andskur

Nikita Velko (senior frontend developer)
Github https://github.com/nikichv

Ivan Podsebnev (devops engineer)
Github https://github.com/naykip

Constantine Czerniak (data scientist)
Github https://github.com/Snaaby

@semuelle semuelle added ready for review The project is ready to be reviewed by the committee members. and removed changes requested The team needs to clarify a few things first. labels May 31, 2021
@semuelle
Copy link
Copy Markdown
Contributor

Data receiving process for machine learning algorithms here is our way to prevent spam attacks or fake data from intruders.

How this helps with spam I understand, but fake data? And who or what are intruders?

Copy link
Copy Markdown
Contributor

@Noc2 Noc2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the application. I have a few questions: Are you aware of offchain::ipfs? How are you going to implement OrbitDb or the Data encryption module of your second milestone? Could you provide more details here?
Since your milestone 1 is mostly about Random Forest, could you also provide more information about this? For example: How do you ensure randomness? Is everything calculated on-chain for this (seems to be very computation heavy and it might become really difficult to benchmark this correctly)? Or do you put only a single specific random forest on-chain? From the application, it seems users have the option to update the algorithm. Isn't this like allowing people to upload their own smart contract? How do you want to integrate this?

@MikeMS-sys
Copy link
Copy Markdown
Contributor Author

@semuelle
It was a kind of mistake in explanation. Data receiving process for machine learning algorithms here is our way to prevent spam attacks. Transactions has a certain value and some fee should mostly prevent from receiving of the fake data. We discussed implementation of some anti-frod algorithm and validation module, but it will costs more resources, so we decide to leave it for a while in the project future plans.

@Noc2
In the 1st milestone we ensure implementation of the basic random forest regression using Smartcore lib (https://smartcorelib.org/user_guide/supervised.html) that will have 100 different independent tree and add another algorithms in the second .

In the 2nd Milestone We are planning to integrate orbit-db via offchain::ipfs pallet to implement complex data storage solution in ipfs. Data encryption module here unfortunately was wrongly included into the application, we've discussed it with Semuelle, but forget to delete.

Yes, users have the option to update the algorithm and upload their own smart contract to the chain.

On-chain calculations interesting but really promise very heavy computation. We plan make a research on the expediency of this in principle to analyse concept of production-ready on-chain calculation maybe on some side-chains in the feature.

Copy link
Copy Markdown
Contributor

@Noc2 Noc2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the response. I have a few follow-up questions:

  • So to be honest, I still don’t fully understand the benefit of putting the algorithm on-chain. If you only care about spam attacks, then there are a lot of other ways to deal with it. Putting the computation and data on-chain means instead of one computer computing and storing everything suddenly a lot of computers need to do it. Which seems highly inefficient.
  • Your example here is a little bit scary to be honest. Putting personal health data on-chain isn’t something that anyone wants (except maybe insurance companies ;-)) and there are a lot of legal problems to overcome. If you generally want to focus on health data, I recommend to focus first on the encryption/privacy part and latter focus on everything else.
  • The orbit-db via offchain::ipfs pallet implementation sounds interesting to me. Could you integrate more details about this into the application? This on it’s own might be interesting for a lot of projects and something we might want to fund.

@MikeMS-sys
Copy link
Copy Markdown
Contributor Author

  1. It seems to be much more inefficient than off-chain, of course. But it might be helpful for some private chains which could suppose it for testing proposals and can deploy ready ml-blockchain without any extra third-party dependencies. Also, we see potential for the future - mean Skynet =)

Initially started with General programm we found inconsistencies with the provisions in european GDPR in particular with "the right to be forgotten". Of the current solutions, we mainly faced with hypotesys based on smart contracts (on-chain), what promise heavy computation.

  1. It's just a most simple prototype to test some basic hypothesis. In Trusted Health Counsil (THC) project one of the most focused side is data encryption/anonymity. Nobody exept data owner knows the owner.

  2. Yes we can. Probably the better way is to create a new application? In this ML pallet we plan to use database in a simplest way - only CRUD’s with simple requirements. But in THC project one milestone (and built-on pallet) is about ipfs-based distributed database and we can move all related stuff to different pallet.

@Noc2
Copy link
Copy Markdown
Contributor

Noc2 commented Jun 14, 2021

Thanks for the quick reply. How about we close this application and you initially apply for an orbit-db pallet or something similar? This might be easier to approve and generally something the grants committee is interested in. It might also help us to get a better understanding of your current work.

@MikeMS-sys
Copy link
Copy Markdown
Contributor Author

@Noc2 We agree, please close this application.

@Noc2
Copy link
Copy Markdown
Contributor

Noc2 commented Jun 16, 2021

Thanks for the update.

@Noc2 Noc2 closed this Jun 16, 2021
alxs pushed a commit that referenced this pull request Jul 20, 2021
Update milestone-deliverables-guidelines.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for review The project is ready to be reviewed by the committee members.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants