Pros and Cons of Multi-Model DBs to Store Threat Intelligence Data

Written by

At SEKOIA we offer our clients a platform that detects and responds to attacks targeting their infrastructure – we call this SEKOIA.IO. To complete this mission, we made strenuous efforts in research and development, with the operationalization of Cyber Threat Intelligence (CTI) at the cornerstone. This strategy brought us on the path of threat modelling by means of the STIX language to increase our capacity to format, enrich and exchange threat intelligence information at large scale.

STIX is a standard for expressing information about computer threats in a structured and unambiguous way. Based on JSON, it has the potential to allow automatic information exchange between the many tools used to ensure the security of an organization.

STIX defines two categories of STIX objects: STIX Domain Objects (SDO) and STIX Relationship Objects (SRO). As soon as a piece of threat intelligence data enters our threat intelligence system, it is converted to the STIX format: it enables our security teams to match indicators seen on their client networks. It also allows us to share threat intelligence with other companies.

It also helps normalize the data we receive from various sources including OSINT intelligence, partners, premium services, etc. Using such a complex standard presents some advantages and raises a few critical issues, starting with finding the easiest and most efficient way to store STIX data. From our experience, these are the pros and cons of storing STIX data in relational databases, document related databases and graph databases.

Relational Databases

Storing relationships in a relational database is a natural fit obviously, e.g. using a join table. However storing the SDOs is problematic for several reasons. The STIX standard describes multiple types of objects and each has its own set of attributes.

The most common way to solve this issue would be to have one base table for the common attributes, and another to store the specific fields. Inserting data in this kind of schema is not efficient requiring us to write in two tables when adding a new object.

Querying is even more challenging. For example, filtering on an attribute that is not common to all the objects (so not in the base table) requires querying all the tables with this attribute (many types of objects may have this attribute) and then joining them with the base table to get the remaining fields.

Another key issue is performing graph queries in RDB engines. Understanding the relationships of level 1 is fairly straightforward; getting the relationships to level 3 is far more complicated. For example, retrieving all of the indicators used by a specific threat actor.

On the one side, a threat actor may have relationships with campaigns, attack patterns, malware, or directly to indicators. On the other side, the indicators may be directly linked to a threat actor, but they could also be linked to campaigns, malware, attack patterns and so on. Getting all the indicators linked to a threat actor means getting all the indicators directly linked to it, but also getting the ones linked through other objects.

Creating this query in SQL is very complicated: you would need to get all the relationships for the threat actor and get all the relationships from these relationships and so on until you get all the indicators.

Document-oriented Databases

Using a document-oriented database solves many of these issues. All the objects would be stored in the same collection so you would have to perform only one write to the database. Requests based on a field value doesn’t need to perform any join operation and is pretty simple: the document having this field will be evaluated and the others will be ignored.

The problem in this case comes with the relationships. Document-oriented databases support relationships based on two mechanisms: embedded documents or documents references.

Embedded documents make the query easy as no joint queries are required, but this breaks down when you need to update the embedded document. If the embedded document is present in multiple main documents, then you will need to update all the documents that embed it.

Alternatively, updating the related content is easy using referencing pointers between related documents until you remove the referenced document. You would then need to go through all the documents, find the ones holding a reference in it and remove the details of the relationship.

Graph Databases

Graph databases are useful to create edges between nodes and query the graph obtained from these relationships. It is designed to make query traversal or shortest path computation efficient.

The issue with graph databases is that they are not designed to store documents or query them. The supported types for the attributes of the documents are often limited, and the query syntax allows only basic matching.

A common solution is to use a document database with a graph database to handle relationships. This combined solution works reasonably well but it creates a lot of complexity. The insertion or removal needs to be done in two databases, and you would need to ensure consistency between the two at the software level.

Multi-model Databases

A multi-model database system is the ideal scenario, as it supports all three data models: key/value, documents and graphs, all at the same time.

In order to provide this flow in the most convenient and efficient way, it makes sense to allow access to all data models with the same query language. Right now, that is not available in layered multi-model databases. For instance, some offer a series of different APIs and others provide one query language for columnar data and another for graphs.

That’s where the native multi-model databases come in. The architecture of a native multi-model database allows developers to work against a single API and work with the very same dataset. The integrated data access layer has full knowledge about the queries at any given time and can therefore automatically optimize even combined queries efficiently.

Conclusion

ArangoDB allows SEKOIA’s security analysts to get all the context needed around an indicator in an efficient manner, improving their investigation workflow. The ArangoDB native multi-model database was the best fit for storing and identifying threat intelligence data.

With a native multi-model, all of the queries are done fast, and we can access a different data model by simply changing a query. With this multi-model database engine, SEKOIA can store STIX data in JSON and perform complex queries without sacrificing performances.

Done right, the combination of document, key/value and graph is no compromise. It can, as a document store, be as efficient as a specialized solution, and it can, as a graph database, be as efficient as a specialized solution.

What’s hot on Infosecurity Magazine?