Is Your Data AI-Ready?

Written by

Rapid advancements in generative artificial intelligence (AI) are reshaping strategic agendas and impacting organizations around the globe.  

Despite the longstanding presence of AI, the capabilities of ChatGPT, Google Bard, Microsoft Copilot, and others demonstrate remarkably human-like responses, unlocking diverse possibilities.  Organizations need to recognize the transformative potential of AI and develop strategies to leverage these technologies for insights, innovation, and decision-making.  

Addressing privacy and security concerns associated with AI implementation should be at the top of the agenda for CISOs and CIOs.

AI-powered algorithms and techniques can automate data processing, allowing for real-time analytics, pattern recognition, and predictive modeling.  This will revolutionize information management practices, uncover hidden insights, improve operational efficiencies, and drive strategic decision-making.  The effectiveness of AI depends on proper training and fine-tuning.

Generative AI creates its output by evaluating a vast collection of unlabeled unstructured data.  It responds to prompts with output that aligns with the realm of probability as established by that dataset.  Similar to training new employees, you should train and continuously fine-tune your private LLM using your corporate intelligence.

Many organizations focus on refining and fine-tuning the algorithms used to develop AI models.  A better approach is to focus on the data, rather than the model.  This data-centric approach keeps the model and code constant while iteratively improving the data.  The outcome of AI solutions is driven more by enhancing and enriching the training data, rather than tuning the model or the code.

Corporate Intelligence: Garbage in, Garbage out

When training a private LLM you will encounter some obstacles.  You need to find relevant data that reflects the corpus of corporate intelligence.  You also need to ensure that sensitive data, particularly PII and PHI, does not become part of the model.  Data exists in applications, repositories, and endpoint devices, but you need to separate relevant data from ROT (redundant, obsolete, trivial) data.  Too often organizations lack knowledge about data’s location, the currency of information, or the people responsible for creating or editing it.  As a result, there may be too much data for training, you might not collect all corporate intelligence, and there is no guarantee that much of the content is accurate or relevant.

Users generate too many copies of documents every day through routine actions like copying, pasting, downloading, uploading, attaching, checking out, and checking in.  Users and systems also create many derivatives, which are different from the original but similar.  This includes saving files to different formats, like PDF.  These copies and derivatives are the root cause of the redundancy problem.

How to Prepare Data for Generative AI

It is essential to identify file copies and derivatives at a minimal cost to minimize or hopefully eliminate redundant and obsolete data.  In file systems, a file's identifier is typically a combination of its name and location.  This information is not permanent, as files may change during use or travel.  Copying a file creates a new independent file, making identification challenging.  Users or systems are making a judgment by a file name, location, and perhaps other metadata associated with the file.  Effective file identification involves comparing at least the hash of each file, utilizing AI tools for analysis, or relying on user discretion.

Content Virtualization overcomes this limitation of existing file systems.  It makes files independent of their physical location.  A virtualized file has a unique identifier and a version number, and you can identify them by these parameters, regardless of location, name, or other metadata.  You can treat all the copies as the same file.  When users or systems update the content, all the copies in different storage locations, applications, and endpoints update automatically.

By using Content Virtualization, you can identify the entire lifecycle of a file, including its origin, modifications, and access history.  It helps users reduce redundant copies dramatically and allows you to eliminate obsolete or redundant data with confidence.  Your organization will not only reduce your threat surface by minimizing ROT data but also alleviate the burden of setting a security policy on a file consistently and have accurate content usage with rich context, which is critical for analytics.

Content Virtualization will benefit any organization looking to train its private LLM by ensuring you only use current, valuable data when training it.  This eliminates the issue of garbage in, and garbage out, and helps drive growth using AI technologies.

Brought to you by

What’s hot on Infosecurity Magazine?