Tuesday, January 30, 2024

Evaluation of LLM Chatbots for OSINT-based Cyberthreat Awareness

Evaluation of LLM Chatbots for OSINT-based Cyberthreat Awareness

Computer Science > Cryptography and Security

Knowledge sharing about emerging threats is crucial in the rapidly advancing field of cybersecurity and forms the foundation of Cyber Threat Intelligence. In this context, Large Language Models are becoming increasingly significant in the field of cybersecurity, presenting a wide range of opportunities.

This study explores the capability of chatbots such as ChatGPT, GPT4all, Dolly,Stanford Alpaca, Alpaca-LoRA, and Falcon to identify cybersecurity-related text within Open Source Intelligence.

We assess the capabilities of existing chatbot models for Natural Language Processing tasks. We consider binary classification and Named Entity Recognition as tasks.

This study analyzes well-established data collected from Twitter, derived from previous research efforts. Regarding cybersecurity binary classification, Chatbot GPT-4 as a commercial model achieved an acceptable F1-score of 0.94, and the open-source GPT4all model achieved an F1-score of 0.90. However, concerning cybersecurity entity recognition, chatbot models have limitations and are less effective.

This study demonstrates the capability of these chatbots only for specific tasks, such as cybersecurity binary classification, while highlighting the need for further refinement in other tasks, such as Named Entity Recognition tasks.
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Cite as: arXiv:2401.15127 [cs.CR]
  (or arXiv:2401.15127v1 [cs.CR] for this version)

Submission history

From: Samaneh Shafee [view email]
[v1] Fri, 26 Jan 2024 13:15:24 UTC (683 KB)

Summary

Here is a summary of the key points from the documents:

  • The documents present an evaluation of the capabilities of chatbots and large language models (LLMs) for natural language processing tasks in cyberthreat detection. Specifically, the research focuses on binary text classification and named entity recognition using Twitter data.
  • Several chatbot models are examined, including open source options like GPT4all, Dolly, Alpaca, and Falcon, as well as commercial versions of ChatGPT. Their performance is compared on classifying tweets as cybersecurity-related or not, and on identifying organization and product version entities.
  • For binary classification, ChatGPT and GPT4all achieve the best results, with F1 scores above 0.9. The other models range from 0.64 to 0.86 F1. ChatGPT is also most accurate for entity recognition. Overall, commercial models outperform open source, but GPT4all comes closest.
  • Challenges identified include inconsistencies in chatbot responses, requiring manual validation, and limitations in providing precise named entities without additional fine-tuning. Prompt engineering is noted as an important factor in optimizing chatbot performance.
  • The potential of LLMs and chatbots for cyberthreat detection is demonstrated, but refinement is still needed for optimal real-world application, particularly on specialized tasks like entity extraction. The study provides insights into strengths and weaknesses of different models.

F1 Scores

The F1 score is a common evaluation metric used to measure the accuracy of models for classification and information retrieval tasks. It is the harmonic mean of precision and recall:

F1 = 2 * (precision * recall) / (precision + recall)

Where:

  • Precision is the fraction of predicted positive cases that are correctly real positives. It measures how many selected items are relevant.
  • Recall is the fraction of real positive cases that are correctly predicted positive. It measures how many relevant items are selected.
  • The F1 score balances both precision and recall into a single metric. It ranges from 0 to 1, with 1 being perfect precision and recall, and 0 being the worst.

The F1 score is commonly used instead of raw accuracy when there is an uneven class distribution. It gives a more realistic measure of a model's performance by accounting for false positives and false negatives. Models with high precision but low recall, or vice versa, are penalized compared to models where both are high.

Based on the documents, here are the key F1 scores reported for the cybersecurity binary classification task:

  • ChatGPT-3.5-turbo (16k context): 0.9431
  • ChatGPT-4 (8k context): 0.9410
  • GPT4all: 0.9049
  • Dolly 2.0 (12B parameters): 0.8612
  • Falcon (40B parameters): 0.8511
  • Alpaca-LoRA (65B parameters): 0.8477
  • Stanford Alpaca (30B parameters): 0.6415

The commercial ChatGPT models achieved the highest F1 scores on this task, with GPT4all being the top performing open source model. The scores ranged from 0.6415 to 0.9431 across the different chatbot models examined. The multi-task LSTM model from previous work scored 0.9470.

For the named entity recognition task, the F1 scores were lower overall. ChatGPT-4 achieved 0.41 for organization extraction and 0.54 for product version extraction when tested on a subset of the data. On the full dataset, its F1 score was only 0.10 for extracting all entity types.

So in summary, the F1 scores indicate ChatGPT and GPT4all performed best for classification, while all models struggled more with precise entity extraction, highlighting a limitation in applying chatbots directly for this specialized NLP task.

Authors:

  •  - LASIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal
  • Previous relevant publications:
    • Ferreira et al. (2020) analyzed threat data on Twitter
    • Dionisio et al. (2019, 2020) worked on cyberthreat detection from Twitter using neural networks
    • Alves et al. (2021) presented a system to process tweets for threat awareness

Institutions:

  • LASIGE - Laboratory of Software Engineering, Faculty of Sciences, University of Lisbon

Artifacts:

  • The authors used a dataset of 31281 tweets collected and labeled by Alves et al. (2020). This dataset could potentially be requested for independent verification.
  • Code and models do not seem to be directly shared, but the approaches are described in enough detail to reimplement if needed.
  • Results are presented comprehensively including F1 scores, execution times, prompt examples, and classification outputs. This allows independent assessment.

In summary, the key artifacts are the labeled Twitter dataset and the detailed experimental results. The datasets and implementation details provide potential avenues for reproducing or extending the study if access was granted to the Twitter data.

Some other References:

  1. Daniel Iwugo, "Large Language Models and Cybersecurity – What You Should Know," freecodecamp.org
  2. cybersecurity.dev AI in Cybersecurity: The Role of Large Language Models | cybersecurity.dev

     












Bolstering Cybersecurity: How Large Language Models and Generative AI are Transforming Digital Security | NVIDIA Technical Blog

By Nicola Sessions

Identity-based attacks are on the rise, with phishing remaining the most common and second-most expensive attack vector. Some attackers are using AI to craft more convincing phishing messages and deploying bots to get around automated defenses designed to spot suspicious behavior.

At the same time, a continued increase in enterprise applications introduces challenges for IT teams who must support, secure, and manage these applications, often with no increase in staffing.

The number of connected devices continues to grow, introducing security risks due to an increase in the attack surface. This is compounded by potential vulnerabilities associated with each device.

While there are many security tools and applications available to help enterprises defend against attacks, integrating and managing a large number of tools introduces more cost, complexity, and risk.

​​Cybersecurity is among the top three challenges for CEOs, second to environmental sustainability and just ahead of tech modernization. Generative AI can be transformational for cybersecurity. It can help security analysts find the information they need to do their jobs faster, generate synthetic data to train AI models to identify risks accurately, and run what-if scenarios to better prepare for potential threats. 

Using AI to keep pace with an expanding threat landscape

Cybersecurity is a data problem, and the vast amount of data available is too large for manual screening and threat detection. This means human analysts can no longer effectively defend against the most sophisticated attacks because the speed and complexity of attacks and defenses exceed human capacity. With AI, organizations can achieve 100 percent visibility of their data and quickly discover anomalies, enabling them to detect threats faster.

Although the exponentially increasing quantity of data poses a challenge for threat detection, AI-based approaches to cyber defense require access to training data. In some cases, this isn’t readily available, because organizations don’t typically share sensitive data. With generative AI, synthetic data can help ‌address the data gap and improve cybersecurity AI defense.

One of the most effective ways of synthesizing and contextualizing data is through natural language. The advancements of large language models (LLMs) are expanding threat detection and data generation techniques that improve cybersecurity. 

This post explores three use cases showing how generative AI and LLMs improve cybersecurity and provides three examples of how AI foundation models for cybersecurity can be applied.

Copilots boost the efficiency and capabilities of security teams

Staffing shortages for cybersecurity professionals persist. Security copilots with retrieval-augmented generation (RAG) enable organizations to tap into existing knowledge bases and extend the capabilities of human analysts, making them more efficient and effective.  

Copilots learn from the behaviors of security analysts, adapt to their needs, and provide relevant insights that guide them in their daily work, all in a natural interface. Organizations are quickly discovering the value of RAG chatbots. 

By 2025, two-thirds of businesses will leverage a combination of generative AI and RAG to power domain-specific, self-service knowledge discovery, improving decision efficacy by 50%1.

In addition to not having enough cybersecurity personnel, organizations are challenged in training new and existing employees. With copilots, cybersecurity professionals can get near real-time responses and guidance on complex deployment scenarios without the need for additional training or research.

While security copilots can bring transformational benefits to an organization, they’re only useful when they can provide fast, accurate, and up-to-date information. The NVIDIA AI Chatbot with Retrieval-Augmented Generation workflow provides a great starting point. It demonstrates how to build agents and chatbots that can retrieve the most up-to-date information in real-time and provide accurate responses in natural language.  

Generative AI can dramatically improve common vulnerability defense

Patching software security issues are becoming increasingly challenging as the number of reported security flaws in the common vulnerabilities and exposures (CVEs) database hit a record high in 2022. With over 200,000 cumulative vulnerabilities reported as of the third quarter of 2023, it’s clear that a traditional approach to scanning and patching has become unmanageable. 

Organizations that deploy risk-based analysis experience less costly breaches compared to those that rely solely on CVE scoring to prioritize vulnerabilities. Using generative AI, it’s possible to improve vulnerability defense while decreasing the load on security teams.

Using the NVIDIA Morpheus LLM engine integration, NVIDIA built a pipeline to address CVE risk analysis with RAG. Security analysts can determine whether a software container includes vulnerable and exploitable components using LLMs and RAG. 

This method enabled analysts to investigate individual CVEs 4X faster, on average, and identify vulnerabilities with high accuracy so patches could be prioritized and addressed accordingly. 

A diagram of NVIDIA Morpheus LLM engine for CVE Exploitability using retrieval augmented generation.
Figure 1. CVE exploitability using Morpheus LLM engine supporting model-generated RAG tasks and multiple loops

Foundation models for cybersecurity

While pretrained models are useful for many applications, there are times when it’s beneficial to train a custom model from scratch. This is helpful when there’s a specific domain with a unique vocabulary or the content has properties that do not conform to traditional language paradigms and structures. 

In cybersecurity, this is observed with certain types of raw logs. Think about a book and how words form sentences, sentences form paragraphs, and paragraphs form chapters. There’s an inherent structure that is part of the language model. Contrast that to data contained in a format like JSON-lines or CEF. Proximity of the data keys and values doesn’t have the same meaning. 

Using custom foundation models presents multiple opportunities.

  • Addressing the data gap: while making better use of the influx of data can lead to improved cybersecurity, the quality of the data matters. When there is a lack of available training data, the accuracy of detecting threats is compromised. Generative AI can help ‌address the data gap with synthetic data generation, or by using large models to generate data to train smaller models.
  • Performing “what if” scenarios: novel threats are challenging to defend against without data sets to build the defenses. Generative AI can be used for attack simulations and to perform “what if” scenarios—to test against attack patterns that haven’t yet been experienced. This dynamic model training, based on evolving threats and changing patterns in data can help to improve overall security.
  • Feed downstream anomaly detectors: use large models to generate data that train downstream, lightweight models used for threat detection, which can reduce infrastructure costs while keeping the same level of accuracy.

NVIDIA performed many experiments and trained several cybersecurity-specific foundation models, including one based on GPT-2 style models referenced as CyberGPT. One of those is a model that is trained on identity data (including application logs like Azure AD). With this model, one can generate highly realistic synthetic data that addresses a data gap and can perform “what if” scenarios. 

Figure 2 shows the Rogue2 F1 scores for CyberGPT models of various sizes, with each instance achieving around 80% accuracy. This means that 8 out of 10 logs generated are virtually indistinguishable from logs generated by real network users.

A bar chart showing 80% accuracy for Rogue2 F1 scores of CyberGPT models generated compared with authentic logs.
Figure 2. Accuracy and realism scores of logs generated by CyberGPT models

As for training times, a supercomputer isn’t necessary to realize quality results. In testing, training times were as low as 12 GPU hours for a GPT-2-small model with character-level tokenization. This model is trained on 2.3M rows of over 100 user logs with 1,000 iterations. This model was trained on multiple types of data, including Azure, SharePoint, Confluence, and Jira.

Experiments were also run with tokenizers–primarily character-level tokenizers, off-the-shelf byte pair encoding (BPE) tokenizers, and custom-trained tokenizers. While there are benefits and drawbacks to each, the best performance comes as a result of training custom tokenizers. This not only enables more efficient use of resources due to the custom vocabulary, but it results in reduced tokenization errors and can handle log-specific syntax.

While these results reflect experiments with language models, the same tests with LLMs achieve similar results.

Synthetic data generation provides 100% detection of spear phishing e-mails

Spear phishing e-mails are highly targeted, and therefore, very convincing. The only real difference between a spear phishing (and, in general, any effective phishing campaign) and a benign e-mail is the intent of the sender. This makes spear phishing challenging to defend against with AI because there is a lack of available training data. 

To explore the potential of synthetic data generation in enhancing spear phishing e-mail detection, a pipeline was constructed using NVIDIA Morpheus.

With off-the-shelf models, the spear phishing detection pipeline missed 16% (about 600) of malicious e-mails. The uncaught malicious e-mails were then used to create a new synthetic dataset. A new intent model was learned from the synthetically generated e-mails, and integrated into our spear phishing detection pipeline. The addition of this new intent model feature in the detection pipeline resulted in 100% detection of spear phishing e-mails trained solely on synthetic e-mails. 

The NVIDIA spear phishing detection AI workflow provides an example of how to build this solution using NVIDIA Morpheus.

A diagram showing NVIDIA Morpheus spear-phishing detection AI pipeline using generative AI.
Figure 3. Spear phishing detection pipeline built using synthetically generated spear phishing e-mails that correspond to specific behavioral intents 

A comprehensive approach to enterprise security

The NVIDIA AI platform is uniquely positioned to help address these challenges–building in security at multiple levels. At the hardware infrastructure level, and beyond the data center perimeter to the edge of every server, while also providing tools that help to secure your data with AI. 

Learn more

Watch the session from Bartley Richardson, head of cybersecurity engineering at NVIDIA, to see demonstrations of the use cases illustrated in this post. Learn about integrating language models and cybersecurity featured at NVIDIA LLM Developer Day.

Check out the November 2023 release of NVIDIA Morpheus to access the new LLM engine integration feature, and get started with accelerated AI for cybersecurity. 

Find out how NVIDIA NeMo provides an easy way to get started with building, customizing, and deploying generative AI models. 

NVIDIA Morpheus and NeMo are included with NVIDIA AI Enterprise, the enterprise-grade software that powers the NVIDIA AI platform.

  1.  IDC FutureScape: Worldwide Artificial Intelligence and Automation 2024 Predictions, #AP50341323, October 2023 ↩︎

No comments:

Post a Comment

Novel AI System Achieves 90% Accuracy in Detecting Drone Jamming Attacks

Loss convergence analysis using test data under LoS and NLoS conditions     Novel AI System Achieves 90% Accuracy in Detecting Drone Jamming...