Significant advances are being made in artificial intelligence, but accessing and taking advantage of the machine learning systems enabling these developments can be challenging, especially for those with limited resources. These systems tend to be highly centralized, their predictions are often sold on a per-query basis,1 and the datasets required to train them are generally proprietary and expensive to create on their own. Additionally, published models run the risk of becoming outdated if new data is not regularly provided to retrain them.
We envision a different paradigm, one in which people will be able to easily and cost-effectively run machine learning models with technology they already have, such as browsers and apps on their phones and other devices. In the spirit of democratizing AI, we have proposed Decentralized & Collaborative AI on Blockchain.2
Through this new framework, participants can collaboratively and continually train and maintain models, as well as build datasets, on public blockchains, where models are generally free to use for evaluating predictions. The framework is ideal for AI-assisted scenarios people encounter daily, such as interacting with personal assistants, playing games, or using recommender systems. An open-source implementation for the Ethereum blockchain is available on GitHub,3 and our paper “Decentralized & Collaborative AI on Blockchain”4—co-authored by myself and Prof. Bo Waggoner,5 a postdoc researcher with Microsoft at the time of the work—was presented at the second IEEE International Conference on Blockchain in July 2019.
Leveraging blockchain technology allows us to do two things that are integral to the success of the framework and its governance: offer participants a level of trust6 and security and reliably execute an incentive-based system to encourage participants to contribute data that will help improve a model’s performance.
With current web services, even if code is open source, people cannot be 100 percent sure of what they are interacting with, and running the models generally requires specialized cloud services.7 In our solution, we put these public models into smart contracts,8 code on a blockchain that helps ensure the specifications of agreed-upon terms are upheld. In this framework, models can be updated on-chain (meaning within the blockchain environment) for a small transaction fee, or used for inference off-chain, locally on the individual’s device, with no transaction costs.
Smart contracts are unmodifiable and evaluated by many machines, helping to ensure the model does what it specifies it will do. The immutable nature and permanent record of smart contracts also allow us to reliably compute and deliver rewards for good data contributions. Trust is important when processing payments, especially in a system like ours that seeks to encourage positive participation via incentives (further on this infra). Additionally, blockchains such as Ethereum have thousands of decentralized machines all over the world,9 making it less likely a smart contract will become completely unavailable or taken offline.
Ethereum nodes located around the world as of August 5, 2020.10
Deploying and Updating Models
Hosting a model on a public blockchain requires an initial one-time fee for deployment, usually a few dollars for the public Ethereum chain, based on the computational cost to the blockchain network.11 From that point, anyone contributing data to train the model whether that be the individual who deployed it or another participant will have to pay a small fee, usually a few cents, again proportional to the amount of computation being done.
Using our framework, we set up a Perceptron12 model capable of classifying the sentiment, positive or negative, of a movie review. As of August 2020, it costs about USD $0.40 to update the model on Ethereum. We have plans to extend our framework so that most data contributors will not have to pay this fee. For example, contributors could be reimbursed during a reward stage, or a third party could submit the data and pay the fee on their behalf when the data comes from usage of the third party’s technology, such as a game.
To reduce computational costs, we use models that are very efficient for training such as a Perceptron or a Nearest Centroid Classifier. We can also use these models along with high-dimensional representations computed off-chain.13 More complicated models could be integrated using API calls from the smart contract to machine learning services, but ideally, for transparency and persistence, models would be kept completely public in a smart contract.
Adding data to a model in the Decentralized & Collaborative AI on Blockchain framework consists of three steps: (1) The incentive mechanism, designed to encourage the contribution of “good” data, validates the transaction—for instance, requiring a “stake” or monetary deposit; (2) the data handler stores data and metadata onto the blockchain; (3) the machine learning model is updated.
Blockchains easily let us share evolving model parameters. Newly created information such as new words, new movie titles, and new pictures can be used to update existing models hosted regardless of a specific person’s or organization’s ability to update and host the model themselves. To encourage people to contribute new data that will help maintain the model’s performance, we propose several incentive mechanisms: gamified, prediction market-based, and ongoing self-assessment.14
Gamified: Like on Stack Exchange15 sites, data contributors can earn points and badges when other contributors validate their contributions. This proposal relies solely on the willingness of contributors to collaborate for a common good—the betterment of the model. This is our baseline proposal since it should be easy to implement and understand, as no assets of any value are exchanged.
Prediction market–based: Contributors are rewarded if their contributions improve the performance of the model when evaluated using a specific test set. This proposal builds on existing work using prediction market16 frameworks to collaboratively train and evaluate models, including “A Collaborative Mechanism for Crowdsourcing Prediction Problems17” and “A Market Framework for Eliciting Private Data.”18
The prediction market-based incentive in our framework has three phases:
A commitment phase in which a provider stakes a bounty to be awarded to contributors and shares enough of the test set to prove the test set is valid;19
A participation phase in which participants submit training data samples with a small deposit of funds to cover the possibility that their data is incorrect;
A reward phase in which the provider reveals the rest of the test set and a smart contract validates that it matches the proof provided in the commitment phase.
Participants are rewarded based on how much their contributions helped the model improve. If the model did worse on the test set, then participants who contributed “bad” data lose their deposits.
Here is how the process looks when running in a simulation:
For this simulation, a Perceptron was trained on the IMDB reviews dataset for sentiment classification.20 The participants contributing “good” data profit, the participants contributing “bad” data lose their funds, and the model’s accuracy improves. This simulation represents the addition of about 8,000 training samples.
Ongoing self-assessment: Participants effectively validate and pay each other for good data contributions. In such scenarios, an existing model already trained with some data is deployed. A contributor wishing to update the model submits data with features x, label y, and a deposit. After some predetermined time has passed,21 if the current model still agrees with the classification, then the contributor’s deposit is returned. We now assume the data has been validated as “good,” and that contributor earns a point. If a contributor adds “bad” data—that is, data that cannot be validated as “good”—then the contributor’s deposit is forfeited and split among contributors who have earned points for “good” contributions. Such a reward system would help deter the malicious contribution of “bad” data.22
For this simulation, a Perceptron was again trained on the IMDB reviews dataset for sentiment classification.23 This figure demonstrates balance percentages and model accuracy in a simulation where an adversarial “bad contributor” is willing to spend more than an honest “good contributor.” Despite the malicious efforts, the accuracy can still be maintained, and the honest contributor profits. This simulates the addition of about 25,000 training samples.
From the Small and Efficient to the Complex
The Decentralized & Collaborative AI on Blockchain framework is about sharing models, making valuable resources more accessible to all, and—just as importantly—creating large public datasets that can be used to train models inside and outside of the blockchain environment. We have addressed some common questions about our project at https://aka.ms/0xDeCA10B-FAQ.
Currently, this framework is mainly designed for small models that can be efficiently updated. As blockchain technology advances, we anticipate that more applications for collaboration between people and machine learning models will become available, and we hope to see future research in scaling to more complex models along with new incentive mechanisms.