Code for running the GNN is made available on GitHub as part of the survey paper from 
Marcelli et. al, How Machine Learning Is Solving the Binary Function Similarity Problem.
The code can be found at this link: https://github.com/Cisco-Talos/binary_function_similarity/tree/main/Models/GGSNN-GMN
and comes with a descriptive README. Our code (and documentation) is intended to complement, not
replace, that work. Therefore, we only provide and discuss our supplemental code for extending
the GNN to return the function embeddings it produces. Readers should also familiarize themselves 
with the original repository before running our experiments.

**Note: running this code will require a copy of IDA Pro**

# Using the GNN to generate function embeddings

0. Clone the Marcelli GitHub repository: *git clone https://github.com/Cisco-Talos/binary_function_similarity.git*

1. Integrate our supplemental code:
	
	1. Comment out lines #157 and #158 in *binary_function_similarity/IDA_scripts/IDA_flowchart/IDA_flowchart.py*
       (Unlike Marcelli, we do not skip functions with fewer than 5 basic blocks)
       	   
    2. Replace *binary_function_similarity/IDA_scripts/generate_idbs.py* with 
       *refuse-public/data-processing/gnn/generate_idbs.py*, which includes support for 
       the MOTIF, CommonLibraries, and BinaryCorp datasets. Update the paths to those dataset files in
       lines 181 and 186. 
       	          
    3. Replace *binary_function_similarity/Models/GGSNN-GMN/NeuralNetwork/gnn.py* with our custom
       *gnn.py* file. Our file adds support for using the GNN to generate and save embeddings,
       without any downstream evaluation. It also adds support for testing on our new datasets.
       
    4. Replace *config.py*, *build_model.py*, and *gnn_model.py* in 
	   *binary_function_similarity/Models/GGSNN-GMN/NeuralNetwork/core* with our custom files.
       *build_model.py* and *gnn_model.py* add support for returning and saving the embeddings
       generated by the model. *config.py* adds new configuration options to support testing
       on our additional datasets.
       
    5. Add *graph_factory_embeddings.py* to *binary_function_similarity/Models/GGSNN-GMN/NeuralNetwork/core*.
       This file contains a custom dataloader that passes each function to the model once, to 
       generate an embedding. This is in contrast to the pair and triplet loaders provided in the 
       original repository.
    
2. Preprocess the data using the code provided by Marcelli, after making the modifications
   indicated by Steps 1.1 and 1.2. **This will require IDA Pro**. Refer to the Marcelli 
   README at https://github.com/Cisco-Talos/binary_function_similarity/tree/main/IDA_scripts.
   Our instructions below, supplement, but do not replace, the ones included there. 
      
	1. Follow the Marcelli README instructions to set up the needed requirements.
				
	2. Set the appropriate file paths in *generateidbs.py*. Then run *python3 
           generateidbs.py --dataset*, where "dataset" is either "motif", 
           "commonlibraries", or "binarycorp".
	   When possible, we recommend making the *.pdb* files for a dataset
           available to IDA. This will allow for more accurate analysis. 
	   
	3. Follow the Marcelli README instructions to run the IDA Flowchart plugin over the IDBs.
	
	4. Run *python3 refuse-public/data-processing/gnn/Dataset-dataset creation.py*, where
	   "dataset" is "MOTIF", "CommonLibraries", or "BinaryCorp". First, modify line #44 (output directory),
	   line #48 (path to *.csv* file generated in Step 2.3), and, for MOTIF only, line #49 
	   (path to *motif_dataset.jsonl* file, which can be download at https://github.com/boozallen/MOTIF).
	   Place the resulting *Dataset-dataset.csv* file in *binary_function_similarity/DBs/Dataset-dataset*,
	   where "dataset" is "MOTIF", "CommonLibraries", or "BinaryCorp".
	   
	5.  Follow the Marcelli README instructions to run the IDA ACFG_disasm plugin over the *.json*
	    generated in Step 2.4.
	    
	6. Navigate to *binary_function_similarity/Models/GGSNN-GMN*. Follow the instructions in
	   Part 1 of the README to run *gnn_preprocessing.py* in *--training* mode, passing as input the 
	   output from Step 2.5. Set the output directory as 
	   *binary_function_similarity/Models/GGSNN-GMN/Preprocessing/Dataset-dataset*, where
	   "dataset" is "MOTIF", "CommonLibraries", or "BinaryCorp".
	   
3. Generate embeddings using the trained GNN model released by Marcelli et. al. Follow the
   instructions in Part 2 of the README in *binary_function_similarity/Models/GGSNN-GMN*. 
   After building the docker image, run the following docker command:
   
   ```
   docker run --rm \
        -v $(pwd)/../../DBs:/input \
        -v $(pwd)/Preprocessing:/preprocessing \
        -v $(pwd)/NeuralNetwork/:/output \
        -it gnn-neuralnetwork /code/gnn.py --embeddings \
            --model_type embedding --training_mode pair \
            --features_type opc --dataset dset \
            -c /code/model_checkpoint_GGSNN_pair \
            -o /output/Dataset-dset
   ```
   where "dset" is either "motif","commonlibraries", or "binarycorp". The resulting 
   embeddings and labels files will be stored in 
   *binary_function_similarity/Models/GGSNN-GMN/NeuralNetwork/Dataset-dset*.
		
**Troubleshooting** 
We have occasionally encountered the following error:
```
tensorflow.python.framework.errors_impl.InvalidArgumentError: data.shape = [1470,128] does not start with segment_ids.shape = [1271]
	 [[{{node graph-embedding-net_3/graph-aggregator/UnsortedSegmentSum}}]]
```
and have found that it can be resolved by replacing line 116 of 
*binary_function_similarity/Models/GGSNN-GMN/NeuralNetwork/core/graph_factory_utils.py* with
`graph_idx.append(np.ones(len(f), dtype=np.int32) * i)`
