2 min readfrom Machine Learning

Non-deterministic Vulnerability Detection Benchmark System [P]

I work in firmware adjacent to AI, so not an ML guy exactly, so that's why I've come here. For work we got a bit concerned about Mythos and all the hype made me explore some benchmarking work. I now have this pretty cool benchmark that's about 80% done sitting around and haven't had the time to polish it up and show it off.

I was hoping some more AI focused people could check it out, tell me if it's duplicate work, or if it is worth putting some time into and finishing. Also happy for some help too.

The rundown of the code is that it is Juliet code that's been "hidden" to look somewhat like a real codebase, removing LLM's natural advantage when viewing known CWEs, while preserving the "ground truth" associated with Juliet. I also used an LLM to inject comments into the code in accurate, misleading, or neutral sentiments, allowing the user to examine how comments and plain English data can manipulate an LLMs ability to identify a CWE.

There are a couple hundred CWEs, generally enough code to fill up the input context, the work that needs to be done is around presentation, actual benchmarking of publish LLMs, and possibly pruning of a couple CWEs that might occasionally get caught by certain LLMs as Juliet code still.

Here's the project. Hopefully this doesn't break rule 6. I am not a regular here, just looking for advice.

submitted by /u/Psychological_Meat_6
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#no-code spreadsheet solutions
#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#real-time data collaboration
#real-time collaboration
#rows.com
#big data management in spreadsheets
#conversational data analysis
#financial modeling with spreadsheets
#intelligent data visualization
#natural language processing
#data visualization tools
#enterprise data management
#big data performance
#data analysis tools
#data cleaning solutions
#automated anomaly detection
#Vulnerability Detection
#Benchmark