How to prepare your data to get the best results from AI

Artificial intelligence (AI) is a powerful tool that can help your businesses automate data processing and derive value. However, the quality of the results produced by AI algorithms is only as good as the quality of the data fed into them. In other words, garbage in equals garbage out. So, how can we ensure that our AI algorithms are fed with good-quality data to produce good-quality outputs?

Understanding the foundations of AI

The first step for preparing data for AI is to understand what AI is and what it’s not. AI refers to a range of technologies that enable machines to perform tasks that would normally require human intelligence, such as understanding natural language, recognizing images, and making predictions. AI is not magic, and it’s not a silver bullet that can solve all our problems. AI systems do not “think,” nor can they reason; they turn all input into numbers and do incredibly complex math at amazing speeds. AI systems are tools that can help us achieve our goals, but only if we use them correctly.

There are two main types of AI: What can be viewed as the old-fashioned AI, which uses machine learning (ML) to create and train models based on samples of good and bad data. This is used by Intelligent Document Processing (IDP) tools like SharePoint Premium. On the other hand, the “new” AI capabilities, such as generative AI, use neural networks, transformers, and other advanced techniques to understand natural language and generate new content. Both types of AI can be useful, but they require different approaches to data preparation.

Curate and prepare your data and information

To ensure that our AI algorithms produce good-quality results, we need to curate and prepare our data and information. If we want our generative AI assistant tools, like Microsoft Copilot to provide us with good answers to our questions, it has to be able to find the right nuggets of knowledge from within our vast corporate stores of data, information, and knowledge. We must help the AI algorithms by providing as many signals or clues as to what a given item of data or information is and what its context is.

This starts with cleaning up, which means developing policies to guide what is regarded as ROT — redundant, obsolete, and trivial data. To comply with regulatory requirements, it is important to have well-defined record management policies. This includes identifying which of the ROT data needs to be kept for longer periods of time and making sure that data is properly labeled and tagged.

Learn how Copilot can do your busy work for you.

Organization and tagging data

The next step in making data ready for AI is organizing it into well-structured workspaces, folders, and libraries. This provides AI systems with clues as to what the data is and how groups or sets of information are related.

Tagging our data with lots of good quality metadata provides more clues to the types of information, how we have categorized it, and whether we have given it a security classification or noted it as containing privacy-related content like Personally Identifiable Information (PII). Metadata can be simple, single values in fields like Subject, or it can be highly complex with conditional fields, using hierarchical taxonomies or ontologies — fields describing relationships between files. All metadata is helpful to generative AI systems when answering our questions, but ML-based AI can provide great value by automatically creating metadata according to the contents of a file. This means that our people don’t have to undertake the arduous and error-prone task of manually creating all this valuable metadata.

Standardization and training

Standardizing our data formats and controlling versions of our data are other elements of good data and information management that can help AI systems. We can also promote good practices, such as education, training, adoption, and change management efforts, to assist our people in working with data and information and ensure it is used effectively.

Keeping security top of mind

In addition, we need to secure our data appropriately, depending on the use case. This is usually done by locking down sensitive data to prevent unauthorized access, and to stop people from getting to data they should not be able to just by asking an AI certain questions that will uncover sensitive data or information. While this is certainly important, and we absolutely must understand what is available to whom and what systems use AI tools, there are good uses cases that may require opening up other data to AI algorithms. For example, while we might not want someone in the widget production division to access files that contain employee PII, it might be highly useful if suitably vetted HR division employees have an AI assistant that can access such information.

Good quality data in, good quality data out

By following these steps above — cleaning up ROT, putting data and information management good practices in place, and adding lots of good metadata — we can ensure that our generative AI assistants are fed with good quality data, which will help us achieve better results.

In conclusion, the practicalities of AI involve preparing our data to give the algorithms a better chance. By curating and preparing our data, securing it appropriately, and promoting good practices, we can ensure that our AI algorithms produce good-quality outputs. Remember, what you get out of AI is only as good as what you put in.

Microsoft 365 Copilot Readiness for Success
Jed Cawthorne

Subscribe our newsletter

Enter your email to get latest updates.