This is how well AI chatbots trained in IT security work

The providers of large voice models promise a lot. In collaboration with security experts, students have tested what a chatbot can really achieve.

(Image: VH-studio/ Shutterstock.com)

May 28, 2024 at 9:45 pm CEST

8 min. read

iX Magazin

By

Michel Kellner

The use of AI chatbots is finding its way into many companies, but at the same time there are also barriers to the use of ChatGPT and co. As IT service providers handle sensitive information, AI systems must always meet strict data protection requirements, especially when dealing with customer data. Outsourcing partners in the financial sector operate in regulated markets, where the use of services such as ChatGPT could even interest banking regulators. IT specialists work with sensitive social data when providing services in the healthcare environment. In fact, almost every action requires the responsible and traceable handling of confidential data; after all, security specialists should not open up any new areas of attack.

The unrestricted use of ChatGPT and comparable services with an inadequate contractual situation and less than transparent protective measures is therefore prohibited per se for IT system houses, managed service providers and consulting firms - as the processing, storage and further use of the entries in these tools remains hidden. However, the added value of an AI chatbot could simplify many work steps in IT system houses: The systems can develop configurations and code, create concepts and help with research and analysis on customer problems.

To examine the real added value of chatbots, students at Weserbergland University of Applied Sciences (HSW) have investigated the development of a customized chatbot as part of the research cooperation for the security specialist system house AirITSystems from Langenhagen.

Requirements

The research project with HSW aimed to implement the interaction between users and the AI system itself by using an advanced large language model (LLM) with controllable on-premises technology. In addition to a chatbot prototype, the focus was specifically on the question of how well an AI chatbot interface can really help IT specialists in their day-to-day work. An intuitive front end needed to be equally accessible to security consultants, system specialists and analysts at the Security Operating Center.

The aim was to evaluate which model is particularly suitable for IT security and IT-related issues. The students also wanted to investigate the extent to which the LLM can be expanded for specific subject areas, and whether it is possible to integrate the company's own data records. Due to the security requirements, only an encapsulated instance that can be operated in the company's own system architecture was considered.

Videos by heise

Requirements and system structure

The first test runs checked several models for the linguistic quality of their responses. The two models, Mistral-7B-v0.1 and Llama-2-7B-Chat-GPTQ, stood out in particular. In the end, the Mistral model was convincing, as it clearly outperformed the Llama-2 models in its own test runs in the areas of argumentation, mathematics and code generation. The tests were run in an Anaconda environment with benchmark questions from everyday system housework, where the system tested output time and response length. The students carried out a qualitative evaluation of the output.

One of the requirements for the LLM was a balanced ratio of the respective token sizes of input and output. To ensure efficient processing of the questions, input and output could not be too long. The chatbot's answers in particular showed that answers that were too short were more likely to contain misinformation due to the shortage of information. However, the answers should not be too long either, as longer token sequences require more computing power and there is a risk that more complex linguistic contexts will lead to less accurate results. Exceeding the token limit in turn entails the risk of incomplete or incoherent answers, as the LLM may miss important contexts. Here, too, the Mistral model was more convincing than its competitors.

Developing a user interface

The group of budding business IT specialists developed various front ends for the chatbot using the JavaScript libraries Vue.js and React. The good user experience of the developed human-machine interface greatly supported the acceptance of the system in the later test phase.

(Image: Beim Userinterface war der Dark Mode ein viel gefordertes Feature. Aus Entwicklersicht stand die intuitive Nutzbarkeit im Vordergrund.)

In the backend, the first virtual machine with 16 gigabytes of RAM provided a sufficient service. It quickly became apparent during the test that the CPU was not up to the computing workload and that a GPU would have to be used. In the end, several Nvidia M400 graphics cards were used. The server ran on Debian 11 Linux in an Azure environment, the backend consisted of a Python program call that continuously listened locally as a web API for GET requests from the frontend to forward them to the model. The frontend requests landed in an asynchronous task queue, which the LLM processed one after the other.

After the necessary system hardening, encryption for data in transit (SSL) and the use of strong authentication procedures, the required prototype was ready for testing in everyday work.

Use and field experience

The feedback for user-friendliness in the day-to-day work of the IT security specialists was largely positive. Each of the interview participants reported that the front end was easy and intuitive to use. However, when it came to real benefits, the feedback was less clear. The experts see added value above all for exercises, researching technical definitions or formulating texts in concept papers. Here, acting specialists estimate a possible time saving of up to 40 percent, which makes the use of chatbots for these tasks appear attractive.

The greatest need for improvement in the prototype was found to be more concise and shorter answers. It was clear that the answers digressed from the topic and the required database connection for own answers to certain topics overwriting the model output was also a challenge. The students had to completely retrain the model for their own database connection, which took 18 hours in the test alone – and had to be repeated every time the database changed. As user-defined answers are generally volatile data, the time-consuming process of fine-tuning them to your own content would currently still be a showstopper. Retrieval Augmented Generation was not tested as an alternative.

Conclusion: what generative AI is good for in practice

The prototype was successfully introduced, and user acceptance was immediate - the efforts made to develop a good front end paid off. It is not easy to make statements about real added value or even quantifiable savings; depending on the task, the sometimes less concise output of the chatbot was helpful to varying degrees. The conciseness of the answers was improved again after the initial feedback.

The system was usable almost immediately and thus offered users an alternative to the unauthorized use of public AI products from third-party providers. A simple ban on the use of ChatGPT will not go far - the system house's own applications, which can be implemented with manageable effort, flank the AI strategy.

In addition to the author, the following prospective business IT specialists from Weserbergland University of Applied Sciences were involved in the project: Lars Wendt, Max Wilhelm Berberich, Jonas Gieschen, Sebastian Evers, Maik Scheidemantel, Damian Bender, Adrian Michal Romanik, Henri Manderla.