Jump to content
zak100

Machine Learning Tool for Smart Contracts

Recommended Posts

LSTMs are a couple years away from cutting edge.  This is a big ask but let me ask have you implemented a convoluted neural network before?  An LSTM is basically a recurrent net built with four CNNs (that’s a simplification please don’t take that too literally) one for input, one each for short and long term memory and then finally an output network. As I said this is a complex network so if you haven’t built a CNN you need to learn how to do that before scaling up.  

Also what is this for?  LSTMs tend to be best for time series analysis (stock prices, sales data, etc) so there may be a simpler network you could use for your problem.

Share this post


Link to post
Share on other sites
12 hours ago, PoetheProgrammer said:

Also what is this for?  LSTMs tend to be best for time series analysis (stock prices, sales data, etc) so there may be a simpler network you could use for your problem.

As I understand from the paper attached to opening post the data that the NN works with is a sequence and the LSTM model was tested and successful compared to some other approaches for that data. But why @zak100 have selected this specific paper and approach as a starting point is unclear. 

Quote

Sequential Modeling of Smart Contracts

... opcode sequence processed by the LSTM model, followed by the usage of smart contract opcode sequence as input for our learning model to detect security threats, Figure 2.

But authors also acknowledges limitations that may be addressed in future work. One example is limitation regarding loops and function calls in the contracts. Personally I do not have enough experience from this topic to recommend a solution for the issues raised in the paper. 

 

Edited by Ghideon

Share this post


Link to post
Share on other sites
1 hour ago, zak100 said:

Please tell me a simple ML network for this problem

What is an "ML network"? Maybe you mean neural network? Some comments regarding your request:

1: Given the conclusions of the paper, what indications do you have that a simple* solution will be effective to solve the problem**? From the paper:

Quote

In a similar fashion, Shen et al. [46] showcased the importance of using the sequence memory architecture in recurrent neural networks for the task of predicting security events in a computer. With the increased complexity of malicious activities in computer systems, simple methods such as the Markov Chains [42] or 3-gram models [11] are no longer effective in predicting these malicious events. By leveraging on the long-term memory typical of LSTM models, it has been shown that their system was able to predict future events of a machine, specific steps that an attacker would undertake, based on previous observations.

2: Do you have the data required for training the model? 

 

1 hour ago, zak100 said:

Can we do it using Python?

As far as I know there exists a number of suitable libraries and frameworks for Python.

 

*) I do not consider the NN architecture in the paper to be "simple" but simple is relative and this may be due to my lack of experience in this specific topic.
**) I have to assume the problem is the same as discussed in the paper, not some other undisclosed thing

Share this post


Link to post
Share on other sites

Hi,

Ghideon: I am referring to the comment of PoetheProgrammer "so there may be a simpler network you could use for your problem."

 

That's why I am asking what he means by a simpler network.

<2: Do you have the data required for training the model? >

He used 0.6 million contracts, I don't know to download them but I can create a data of 20 to 30 contracts but this would be my next question. First we have to settle on what machine learning (ML) network we would be using. I am saying ML because I want to concentrate on middle layers. We need more hidden layers.

 

So kindly guide which package I am to use in Python for creating a simpler network but have more hidden layers.

 

Zulfi.

Share this post


Link to post
Share on other sites

Just so that we do not waste time; you are looking at the problem in the paper: usage of smart contract opcode sequence as input for our learning a model to detect security threats. Ok? That specific problem, not some other variant, Ok?

 

50 minutes ago, zak100 said:

I am referring to the comment of PoetheProgrammer "so there may be a simpler network you could use for your problem."

Did you look at my follow up to that, in the context of the paper you provided? I do not claim there is no simpler solution to the problem described in the paper. I say that the authors of the paper seem to have come to the conclusion that for the specific problem they worked on they found that simpler* solutions did not do perform well. @PoetheProgrammer's response** is perfectly fine in a more general context and may be the result of a different approach the researchers in the paper used. If you extract other features from the data than they did in the paper, for instance some feature that is not a sequence, then other NN models or architectures may very well be used. Some of those could be simpler but could have other drawbacks.

50 minutes ago, zak100 said:

First we have to settle on what machine learning (ML) network we would be using.

I would not do that. If there is insufficient data to train and test the model then the project is going to fail, no need to waste time. Unless there are some alternative model that would solve the problem with the limited amount of data you have. 

 

50 minutes ago, zak100 said:

I am saying ML because I want to concentrate on middle layers. We need more hidden layers.

Why? More hidden layers the what? 

 

50 minutes ago, zak100 said:

So kindly guide which package I am to use in Python for creating a simpler network but have more hidden layers.

Simpler than what? More hidden layers than what? 

 

*) Again; "simpler" is relative. I use the word as it is used in the paper; a Markov chain or 3-gram model is simpler than a LSTM model.
**) From what I've seen so far PoetheProgrammer have more knowledge than me in these topics. If my reply (unintentionally) contradicts them I would give Poethe's response more weight.

Edited by Ghideon

Share this post


Link to post
Share on other sites

Yes sorry if I wasn’t clear initially.  You asked for guidance on building this LSTM but that’s a big ask and I was advising you to implement smaller networks on your own to build the knowledge and intuition required to implement this paper.  An LSTM is four networks so building CNN is definitely a first step regardless.  In the process you may find a simpler network (or combination of) will suffice for the problem you are trying to solve.

In general if you are wanting help implementing something it needs to be smaller than a research paper.  At least take a first pass at it and come back when you run into trouble or get stumped.  As it is it seems like asking someone to do quite a bit of work for you instead of helping.

Share this post


Link to post
Share on other sites

Hi,

<I would not do that. If there is insufficient data to train and test the model then the project is going to fail>

You mean I should have the same number of contracts that he(author) used. I mean I have  to use around 1 million contracts for ML to work properly??

 

Zulfi.

Share this post


Link to post
Share on other sites
6 hours ago, zak100 said:

You mean I should have the same number of contracts that he(author) used.

No.

6 hours ago, zak100 said:

I mean I have  to use around 1 million contracts for ML to work properly??

No. I just stated that with insufficient data the project will fail regardless of NN architecture. I do not how much you have studied the paper you provided and your level of knowledge about data processing so the following example may be obvious: 

On 12/11/2020 at 2:16 PM, zak100 said:

I don't know to download them but I can create a data of 20 to 30 contracts

Ok. And the paper states about the contracts:

Quote

While we collected a moderately large training set, it was highly imbalanced. It is an issue with classification problems where the classes are not represented equally, and one class outnumbers the other classes by a large proportion. Based on the distribution of the original dataset, 99.03% of the contracts are labeled as not- vulnerable by Maian, while only 0.97% of contracts are either greedy, suicidal, and/or prodigal.

So for a sample of 20-30 contracts there will be on average less than one vulnerable contract. Very simplified we can say that using such a sample, where all objects are of one class (non vulnerable), to train the model you will end up with a model that classifies everything as belonging to that single class (non vulnerable).   

The above example is simplified and intentionally naive, imbalanced data is common in machine learning and not specific for the data in the paper.

 

Edited by Ghideon

Share this post


Link to post
Share on other sites

Hi,

Kindly guide me what for CNN stands and which package of python can be used for creating this network? Provide me any Python link to it.

 

Zulfi.

Share this post


Link to post
Share on other sites
37 minutes ago, zak100 said:

Kindly guide me what for CNN stands

Convolutional Neural Network.

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. (https://en.wikipedia.org/wiki/Convolutional_neural_network)

A glossary with machine learning terms: https://developers.google.com/machine-learning/glossary

Share this post


Link to post
Share on other sites

Hi,

Thanks.

Can we implement CNN in python? Is there any good python link for ML(Machine learning)?

 

Zulfi.

Share this post


Link to post
Share on other sites
3 minutes ago, zak100 said:

Can we implement CNN in python?

Yes. But CNN may not be applicable to the problem you have posted.

 

Share this post


Link to post
Share on other sites

Hi,

Kindly suggest me some other deep learning technique for my example? Also kindly provide me the python implementation link.

 

Zulfi.

Share this post


Link to post
Share on other sites
3 minutes ago, zak100 said:

Kindly suggest me some other deep learning technique for my example?

Just so that we do not waste time; you are still looking at the problem in the paper: usage of smart contract opcode sequence as input for our learning a model to detect security threats. That specific problem, not some other variant, Ok?

Edited by Ghideon

Share this post


Link to post
Share on other sites

Hi,

Yes, a ML network only for Smart contract (SC) related vulnerability detection. 

Zulfi.

Share this post


Link to post
Share on other sites
4 minutes ago, zak100 said:

Yes, a ML network only for Smart contract (SC) related vulnerability detection. 

That is a wider description than the one in the attached paper. I stress this because any hint I may give regarding an approach depends on what type of problem you try to solve. Are you using opcode sequence as in the paper? 

Share this post


Link to post
Share on other sites

Hi,

Yes. I want to follow thesame approach. But I need something which can be implemented in Python.

Zulfi.

Share this post


Link to post
Share on other sites
8 hours ago, zak100 said:

Yes. I want to follow thesame approach. But I need something which can be implemented in Python.

As far as I can see the steps described in the paper can be implemented in Python. But to know if you should use python and which parts of the project that may benefit from other tools a more thorough investigation would be required before I could make a statement. It may also depend on how much you intend to implement. Example: I note that data is preprocessed using the MAIAN tool (outlined in the paper) and that MAIAN depends on Solidity Compiler.

Quote

Hence, we use a high-level language— Solidity to write smart contracts effectively. In order to deploy a smart contract, we compile the solidity code using a compiler, and it will translate our source code into bytecode.

Those tools may or may not be using python but to what extent that affects you is for you to investigate. 

 

How to implement and which libraries to use is another matter depending on details not yet discussed. I could of course provide you with a list of popular python frameworks as of 2020, there are numerous such lists available in any search engine you may want to use. But then you would still be left with sorting out which combination of the frameworks that suits your needs. The task you are asking about going to be solved in a single framework or library. Personally I postpone such decisions (in commercial projects) until later in a project. In other cases specific requirements may tell me what to do, for instance if I am consulting a team that already run everything on .NET I would evaluate Azure products first.

 

Share this post


Link to post
Share on other sites

Hello zak100,

I think I can help you with the implementation of statistical models for what you want, for "Smart Contract (SC) related vulnerability detection". However, before I do this, I am going to need you to describe more about smart contract vulnerability detection. What exactly is this?

A machine learning pipeline, or more generally, a modeling pipeline, begins you assembling or locating a dataset that captures the information you desire to use. So, do you know of any datasets with the vulnerability levels of smart contracts quantified? Next, you'd proceed by implementing or employing a statistical model. In this case, if the vulnerability levels are on a scale of 1-10 or something like this, we'd use a multi-class classification model, and if they are regressive, meaning that if they are some float value like 12.3 or 69.87, we'd use a regression model. In the case that the data are a time series, we then might employ an LSTM. Remember that Neural Networks are not always necessary and may even be unoptimal if we don't have that much data. Once we have our model, we can adjust and fine-tune it to produce the best results while not overfitting, which is when the model learns the dataset it was given too well and reduces how well it generalizes, or phrased differently, how well it performs on new datasets.

Using Google search, I have come across this website: https://smartbugs.github.io/

It has listed several Smart Contract datasets, although I am uncertain if these are the type you are searching for. There are also research papers that came up, such https://www.ijcai.org/Proceedings/2020/0454.pdf and https://alfagroup.csail.mit.edu/sites/default/files/documents/2020. Exploring Deep Learning Models for Vulnerabilities Detection in Smart Contracts.NLeSimple-Master_Thesis.pdf . These might have code to them that you can use. If you search on https://paperswithcode.com/ for "smart contract vulnerabilities" I am hopeful that a few papers would come up. Perhaps you can clone their repositories and play around with their code. I mean, if what you are looking to do already exists, then there is little more efficient that simply using the existing implementation, unless you are trying to reinvent the wheel, which does not seem to be the case here.

A good resources for learning about CNN's from a mathematical standpoint is https://cs.nju.edu.cn/wujx/paper/CNN.pdf and for LSTM's is https://colah.github.io/posts/2015-08-Understanding-LSTMs/ . I do admit that I have not read these yet in full and absorbed their significance.

Scikit-Learn and Keras would probably be good for you for ML implementation in Python.

Also, how do I write a code block on this website?

Once I learn the answer to the above question I will write a simply keras model in Python for you to see.

Edited by The Mule

Share this post


Link to post
Share on other sites

Hi my friend,

Thanks. Very good reply.  Repository is what I am looking for. But I think we must have more SC(smart contracts). As @Ghideon has point out we need more data. I think one hundred thousand would be enough. 

<However, before I do this, I am going to need you to describe more about smart contract vulnerability detection. What exactly is this?>

 

SC are programs which run on the blockchain. They are used for transfer of cryptocurrency. However, they are not faster than credit cards because they are immutable. So for confirming transactions, one has to wait. Because of their programming nature, they can have coding flaws. Bad people can take advantage of this and execute the SC in such a way that the money known as Ether lands into their SC accounts. So before uploading SC on the blockchain one has to make sure that it  does not have flaws so that attacker can't exploit it. My objective is to create a ML(Machine Learning) tool to find out if the SC has a flaw or not.

For vulnerability detection we have some coding practices and we have to check if the SC follows those best practices or not. If not then SC is a dangerous for use so we won't upload it on the blockchain.

I don't know how can i contact you, maybe through provate messages. Certainly I need somebody to help because I am not conversant with ML. Also I nedd help in running SC tools. I don't know if Maian is in a working state or not. I have tried several SC tools but only Remix and two other online tools (Securify and Smartcheck) are good. Rest do not work. I appreciate your offer hope to take advantage of it.

 

God blesses you for providing me all this information.

 

Zulfi.

 

 

 

 

Share this post


Link to post
Share on other sites
9 hours ago, zak100 said:

As @Ghideon has point out we need more data. I think one hundred thousand would be enough. 

Note that my comment regarding data was in the context of the paper you have in the opening post. I did a very quick check of the papers provided by @The Mule and I note that they use other approaches which may have impact on data requirements during training. I'll try to get some time to have a better look at the papers. Do you by the way have some argument why one hundred thousand would be enough? 

 

12 hours ago, The Mule said:

I have come across this website:

Good! I note that they provide a curated dataset with several types of vulnerabilities. That gives Zak100 some alternatives to the data referenced in the paper in OP. 

 

12 hours ago, The Mule said:

Also, how do I write a code block on this website?

Try the button labeled <> in the menu above the post you are creating:  image.png.c2701673201b67d1862e231a7c366fbe.png

Example:

<example>XML Code</example>

And welcome and thanks for contributing to the discussion @The Mule.

 

Share this post


Link to post
Share on other sites

Hi Ghideon,

It was just a guess but I think quantity-wise we have to keep moving up.

 

Do you think that less than 100 thousand is enough?

Zulfi. 

Share this post


Link to post
Share on other sites

 

49 minutes ago, zak100 said:

It was just a guess but I think quantity-wise we have to keep moving up.

Ok! I was curious if you had some facts to back up the statement or if it was a guess.

53 minutes ago, zak100 said:

Do you think that less than 100 thousand is enough?

I do not know, I have not studied this enough to provide an opinion. As for the paper in OP they obtained labels for smart contracts by running them through the Maian tool. To what extend that is tied to number of contracts and what it may imply for other methods I have not investigated.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.