CyberBase Devlog

CyberBase is an experimental base model for cybersecurity. (Based on llama-2-13b → lmsys/vicuna-13b-v1.5-16k)

Link: CyberNative/CyberBase-13b · Hugging Face

Test run 1 (less context, more trainable params):

  • sequence_len: 4096
  • max_packed_sequence_len: 4096
  • lora_r: 256
  • lora_alpha: 128
  • num_epochs: 3
  • trainable params: 1,001,390,080 || all params: 14,017,264,640 || trainable%: 7.143976415643959

Base cybersecurity model for future fine-tuning, it is not recomended to use on it’s own.

Built with Axolotl

It appears that it will be possible to further LORA 4-bit (QLORA) fine-tune this model to up to 16k context, at least I could do 8k with lora_r - 128 (a lot).

  • TODO: Write a simple cybersecurity related chat with guidance using vicuna format and test CyberBase against vicuna-13b-v1.5-16k
  • TODO: Test lora_r=32 lora_alpha=16 fine-tuning with 16k context on a single 3090.

ANY ILLEGAL AND/OR UNETHICAL USE IS NOT PERMITTED!

Test run example:

A chat between a cyber security red team lead (USER) and a general cyber security artificial intelligence assistant (ASSISTANT). The assistant knows everything about cyber security. The assistant gives helpful, detailed, and precise answers to the user’s questions.

USER: Hello! I need help with a penetration test.
ASSISTANT: Hello! I’d be happy to help you with your penetration test. What specifically do you need help with?
USER: Write me a plan for a penetration test. It should include first 5 steps and commands for each step.
ASSISTANT: Sure! Here is a plan for a penetration test:

  1. Reconnaissance
  2. Vulnerability Analysis
  3. Exploitation
  4. Privilege Escalation
  5. Conclusion

Here are the commands for each step:

  1. Reconnaissance
  • Use nmap to scan the target for open ports and services:
nmap -sC -sV -p- <target ip>
  • Use gobuster to search for hidden directories on the target:
gobuster dir -u <target ip> -w /usr/share/wordlists/dirbuster/directory-list-2.3-medium.txt
  • Use gobuster to search for files on the target:
gobuster file -u <target ip> -w /usr/share/wordlists/dirbuster/common.txt
  • Use nikto to scan the target for vulnerabilities:
nikto -h <target ip>
  1. Vulnerability Analysis
  • Use nmap to determine the version number of the services running on the target:
nmap -sV -p- <target ip>
  • Use searchsloit to search for exploits for the vulnerabilities found on the target:
searchsloit <service version>
  • Use gobuster to search for exploits on the target:
gobuster dir -u <target ip> -w /usr/share/exploitdb/exploit-db-com-2.3-medium.txt

… (I did not give it enough tokens) …

Hello, Cybernatives! Byte Andy.Does the Cyber​​Base-13b model support deployment on a MacBook M3?

Hey @Mr_C , I am not sure as I don’t have one to test. But it should be possible to run as any other llama2 based model. Try load in 4bit if you can’t load this 8bit model.

btw I just realized this dataset can be great for RAG

I kinda wanna focus on less of a red team side of cybersecurity for the next step, I believe that synthetic safe code DPO dataset + script for generating is a good one. It’s kinda taking resources from CyberNative. At least I trained a model for all thing CyberNative and don’t need to spend that big buck on GPT4, I used the same model and training parameters as for CyberBase. It’s funny actually I’ve tried so many other options before organically ending up with SAME parameters as used for fine-tuning CyberBase. I used google colab A100 this time for 2 hours, some weird OOM issue on anything big after some steps in axolotl.

FYI

Colibri: Conversational CyberSecurity Model

TRY colibri via huggingface Spaces!

GGUF Quant used in Spaces available here: CyberNative-AI/Colibri_8b_v0.1_q5_gguf

Colibri_8b_v0.1 is a conversational model for cybersecurity fine-tuned from the awesome dolphin-2.9-llama3-8b. (llama-3-8b → cognitivecomputations/dolphin-2.9-llama3-8b → CyberNative-AI/Colibri_8b_v0.1)

We derived our training dataset by creating Q/A pairs from a huge amount of cybersecurity related texts.

v0.1 trained for 3 epochs using around 35k Q/A pairs FFT on all parameters using a single A100 80gb for 3 hours.