New AI DarkBERT is trained on the Dark Web

❘ Published: 2023-05-17T11:48:35

darkbert represented by an ai abstract image of a brain colored purple and black

Firefly AI

Researchers have created a new AI called DarkBERT, trained almost exclusively on the Dark Web. While not intended for malicious use, it has faced a lot of seedy sites during its training.

The latest AI to be developed has a unique twist on its training. DarkBERT is built upon the BERT framework, developed by Google. Rather than the chatty capabilities of Google Bard, BERT is used to analyze and produce answers based on a particular data set.

Researchers have created DarkBERT to help better swift through the dark web in hopes of bettering cybersecurity around it. Feeding it a mass of data over the course of nearly 16 days across two sets.

One was “raw” – an unedited quantity of data – and the other, “preprocessed”, with certain aspects of what can be found on the dark web edited out. This includes things like “victim organization name, descriptions of leaked data, and threat statements with sample data”.

They also ensured that images were ejected, in case of illicit and illegal images:

“… our automated web crawler takes the approach of removing any non-text media and only stores raw text data. By doing so, we do not expose ourselves to any sensitive media that is potentially illegal.”

DarkBERT AI has seen the depths of the dark web

The paper goes into detail on how much data they fed DarkBERT, including a table that details every site and category it was filed under. Unsurprisingly, over 1000 pages were filed under adult entertainment.

Most of the research was done by crawling with Tor, the most popular browser for accessing the deep or dark web. As these websites aren’t on the “surface web”, you require the browser to access “onion links”. A vast majority – as also pointed out in the research paper – is now error codes or useless pages with minimal information on them.

DarkBERT has no current plans to release to the public, with a heavy emphasis on the research that the data set won’t be released to the public. A request can be made for academic purposes, due to the nature of the dark web’s materials, however.