The End User Classification Problem

When cloud-based Enterprise Content Management (ECM) came into the mainstream in the mid-2000s, businesses began digitizing old analog processes and migrating content from local machines and on-premises repositories to the cloud. As the world of digital content grew, ECM providers deployed a variety of ways to sort and segment proprietary or sensitive content. For most, this meant relying on IT administrators to set and enforce policies and end users to properly tag and categorize the content they create and share.

That’s the way most organizations still do it today. In other words, the way we think about data classification hasn’t changed much since we were all walking around with flip phones in our pocket.

What we can see now is that end-user classification was never very good on it’s own. In 2007, IT consulting firm AIIM conducted a survey which revealed that classification was a big challenge in Microsoft SharePoint deployments, with only 22% of organizations providing users with any guidance on corporate classification policies. Almost a decade later, AIIM found that more than one- third of users still say it is inconsistent metadata and classifications presenting the biggest issue for their organization.

It is clear that the problem isn’t getting any better. When enterprise cloud was in its infancy, it may have been easier to require everyone in an organization to label, categorize, or tag all content in a repository. But the explosion of unstructured data has quickly made it impossible to keep up. By as early as 2013, IBM reported that humans were creating 2.5 quintillion bytes of data every day, the vast majority of which is generated by businesses.

The size and scope of unstructured enterprise content makes the prospect of end-user classification untenable. A typical Egnyte scan covers 7,000,000 documents, spreadsheets, PDFs and image files. We find sensitive PII in 10 percent of those files spread across hundreds, often thousands of locations. We are officially beyond the point where it makes sense to rely on a network of humans (your employees) to effectively classify that much content and still be productive.

In the age of GDPR, CCPA, PCI and HIPAA, relying on end users to properly classify data isn’t just inefficient, it’s risky too. The definition of PII (personally identifiable information) under GDPR alone encompasses potentially hundreds of pieces of information. Add that to the alphabet soup of other data privacy regulations, separate requirements for data retention and deletion, legal holds, and corporate policies on proprietary content and you’ve got a level of complexity that human brains just aren’t built to handle. Organizations must ask themselves if end-users can reasonably carry that burden or else pay hefty fines for overexposed PII.

At Egnyte, we believe in a different approach that shifts the burden to machines to do the heavy lifting, and leverages human judgement to oversee the process and make decisions. By scanning the content and matching it to preconfigured classification policies, you can get a 360 view of the sensitive data on the system, where it lives, and who has access to it. As new content is created, it is automatically scanned and classified.

This approach doesn’t mark the end of all forms of tagging or manual classification. Employees still have a role to play in responsibly managing and storing their content. But with a little assist from machine learning we can make the process a whole lot better so they can get back to work.

What’s Hot on Infosecurity Magazine?