By Winnie Cheng, Chief Data Scientist, Io-Tahoe LLC
The GDPR (General Data Protection Regulation) requirements taking effect this May can be overwhelming and the cost of non-compliance astronomical. Managed Service Providers and Value-Added Resellers with established customer relationships are uniquely positioned to offer advice and work with their customers to help them grapple with the complexities of this new regulation by bringing in the right software vendors.
The urgency and broad scope of the problem make it difficult to have an operational plan. Data standardization and IT governance policies take too long to reach consensus and don’t address the massive amounts of data already available and residing in data systems.
Combing through existing data systems with an army of Database Administrators (DBA) to locate and classify sensitive data is both slow and expensive — it typically takes an experienced DBA five days to identify sensitive data in one database table. This becomes intractable with organizations that have thousands of databases and hundreds of tables in each, and that doesn’t even take into account the complex relationships between database systems that are difficult for humans to analyze. But machines are able to comb through significant amounts of data spotting patterns and highlighting multi-level relationships, so why not harness and leverage this power?
Imagine you are a TSA officer looking to screen luggage for hazardous material. If all you are equipped with is a flashlight, it will be extremely difficult to look for dangerous chemicals and weapons. Being able to comply with GDPR is similar; personal data is often hidden. The data is not explicitly labeled in enterprise systems, just as luggage with concealed weapons does not have a traveler’s tag providing this information.
For example, how do you know whether ‘1402338420’ is a bank account number or something less sensitive?
Machine learning driven-smart data discovery solutions go beyond traditional approaches that look at metadata (e.g. database schema definitions) and apply advanced algorithms on the data itself. This would be akin to the scanner at the TSA that allows for the detection of chemicals and object patterns that wouldn’t have otherwise identified with a flashlight.
Back to the example of a piece of data with the value ‘1402338420,’ repositories of patterns learned from past sensitive data patterns exist which can help indicate whether this value is likely a bank account number. Its machine learning algorithms get smarter with exposure to more and new sensitive data patterns it learns over time. Feedback from users on its ‘guesses’ is taken and it teaches itself to get even better.
The machine learning algorithm also understands context and relationships, much like advertising systems that suggest travel destinations when booking flight tickets online. These fundamental machine learning ideas from different industries are extended and applied in new and innovative ways for data discovery. The results are powerful and actionable insights that can guide your GDPR compliance team to come up with operational plans to meet the new regulatory requirements.
About The Author
Dr. Winnie Cheng currently serves as Chief Data Scientist for Io-Tahoe. Her expertise is in areas such as large scale distributed systems, machine learning, and artificial intelligence. Dr. Cheng was co-founder and CTO of a fintech machine learning company Flowcast and held Chief and Senior Data Scientist positions at Bankrate, Inc., J.P. Morgan, American Express, and IBM, as well as senior engineering roles at Microsoft and Hewlett-Packard. Dr. Cheng serves as an advisor to Hatcher+, Fundnel Ltd., and ADSKOM, and serves as a mentor to Shoo-in Career. Dr. Cheng holds a PhD from Massachusetts Institute of Technology (MIT) in Computer Science and Artificial Intelligence, and a Masters in Science (MS) from Stanford University in Engineering.