DarkBERT: An AI trained on the Darkweb

05/24/2023Darknet News

Researchers from the Korea Advanced Institute of Science and Technology (KAIST), in collaboration with data intelligence organization S2W have unveiled DarkBERT, a generative AI language model that has been trained exclusively on datasets sourced from the dark web.

Instead of a chatbot like ChatGPT or Bard, the project aims to create a tool to analyze data sets and answer specific queries. DarkBERT can verify if the use of the dark web as a data set would allow AI tools to comprehend language used in those settings better, potentially making it a valuable aid for cybersecurity professionals and law enforcement.

To optimize how DarkBert adapts to the language used on the dark web, the research team created a large-scale database by crawling the Tor network. The team also used deduplication, data filtering, and pre-processing in an effort to allay ethical concerns associated with dark web content, which often contains sensitive information.

The model fed two sets of data over 16 days, with the pre-processed data having redacted information such as the names of victim organizations, details on leaked data, threat statements, and illegal images. Over a thousand pages of this data set were categorized as adult entertainment.

Owing to the potentially risky nature of dark web materials, DarkBert will not be available to the public anytime soon. However, at this time, requests for the use of the AI model for academic purposes can be made.