How to Avoid Drowning in your Security ‘Data Lakes’

Written by

Security monitoring is a ubiquitous task throughout all enterprises that are attempting to not only thwart malicious activities but to understand and optimize authentic traffic to their information systems. The problem is that the monitoring activities generate huge ‘data lakes’. These lakes contain extremely valuable raw data but mostly sit unused because of the difficulty in dealing with large, complex and diverse data sets and because of a lack in knowledge related to what analytics truly means.

The good news is that today’s big data technology and state of the art analytics can dramatically improve the ability to quickly and effectively mine this data to produce the necessary reports, insights and visualizations to optimize an enterprise’s data security environment. The bad news is that most security organizations are enamored with data lakes for the wrong reasons – they have read about data lakes so they must have one. Therefore, first they spend a few years (and a few million dollars) building the lakes with very little thought about what they will do with the raw data, how it can be used to increase their security posture or even how they will measure what benefit they will reap. They very often end up in the middle of a large data lake with no possibility of making it safely to shore.

However, avoiding this fate is not that hard – the industry has over ten years’ worth of good and bad experience – and that means you can learn from the best (and the worst). So skip all the mumbo-jumbo two-pagers that regurgitate the normal herd-feed and do a little bit of thinking your end before you jump into the deep end, and yes – this article too is a two-pager – so use it only to get you thinking again.

Five DOs and DON’Ts

Your first step should not be to figure out how to dump all the data into one place – even if that place is now a Big Data platform. Dumping data and even indexing it is easy and cheap these days – but garbage-in-garbage-out. Instead, start by defining the analytics and the REQUIREMENTS of why you are doing this project (and please don’t define the requirements as “the ability to store all security events for three years in one place “). Build a solution for that ONE thing that has eluded you and your SIEM implementations. Then add another; then another; then generalize. The problem today is not the storage or the computer or the cost of Big Data – it’s the development and the productivity. So, do projects that can show you how productive each option makes you.

This may be uncomfortable at first, but the way you tackle hard problems these days is different. Machine learning techniques, clustering, decision trees etc. work – they work really well, and they work really well on large data sets and when history is available. In fact, security is a fabulous space for these techniques. The problem is that people are very often more comfortable with what they can understand or what they know from before. So they often say “show me first how the machine decided what it decided and then, when I understand the logic, I will accept it and off we go”. However, it doesn’t work that way. Someone can explain the general algorithm to you and show you a sample analysis – but don’t ask them to show you a precise decision path. Learn to trust the machine.

The highest cost is the development, the functionality and the analytics – not the platforms. So look for technologies that make this easy – not for technologies where everything needs an army of people. Many of the Big Data platforms are so complex and so fragmented that you will never manage to do anything. Look for flexibility, agility, simplicity and rapid development – things like NoSQL 2.0 built for Big Data is an excellent example. Look at other areas close to what you’re doing – like analytics on Internet of Things (IoT) data and Machine-to-Machine (M2M) data. Start with some concrete project and get to it quickly (rather than wait a year or two for the lake to be filled).

It’s true there is safety in numbers but a lot of what you are thinking about may be five years old and has been abandoned by the leaders. For example, while everyone was rushing to do “classic Hadoop” (Map Reduce, for example) Google had abandoned it for over eight years. Skate to where the puck is – not to where it’s been.

Most companies are not Google or Facebook or Netflix. Don’t assume you have the resources, talent and drive these companies or some of the innovative startups have. Your data sets are often not “Google size” and your requirements are not at that level. Choose based on your needs – not others’ needs. Analyze your own needs and only then pick – or you will end up with something optimized for someone else.

If I had to pick one guiding principle it’s this – focus on what you want to do first and define it based on functionality and outcomes (with the data lake helping you to do it). Don’t set out to just build a large data lake and do think about how to pragmatically utilize this valuable data-without drowning.

What’s hot on Infosecurity Magazine?