Data: the good, the bad and the ugly

Authors Patrick White, Sarah Jarvis

Data is one of the most abundant resources on the planet and at the core of AI. Companies are starting to realize the importance of making sure that their data is good - but what does 'good' really mean and how do you get there?

“A very important thing to realize is: you aren’t just a grocery company, or a finance company, or an insurance company. You are also a data company.”

Sarah Jarvis is our Director of Applied Machine Learning and Data Science. Just like the surveyor who inspects a property you’re looking to buy, she and her team stress-test the data of companies who want to work with us.

“There is no such thing as perfect data. I rarely use absolutes, but the notion of perfect data simply doesn't exist. There is always noise in there.”

The growth of data science

The term ‘data science’ was first coined in the early 70s - but it’s only become a household name and profession since the early 00s. The data science community grows year-on-year to meet the demand from companies who increasingly seek to deploy data-led strategies.

“The role of data science is generally: you have data, you weren’t necessarily there when it was created, and you want to infer what’s happening. With AI we then take it a step further, working with machine learning engineers before we best apply that data to our models.”

“I rarely use absolutes, but the notion of perfect data simply doesn't exist.”

Data is something nearly every company now owns and has to take care of, and it’s become a vital resource to help forecast, predict and better serve customers. Low quality data isn’t necessarily going to stifle your AI projects, but it depends on what kind of AI you are wanting to implement.

“We use Gaussian processes (a type of probabilistic modelling) in the Secondmind Decision Engine, which allows us to work with challenging and sparse data. Many businesses don’t have the oceans of data that some do, and we provide a realistic and practical AI approach for them. But the data still needs to be an acceptable standard.

“There are three key things we’ll often assess to evaluate the quality of data we’ll be working with. These checks help us map out a customer AI project, but they also work as a valuable first step for any business that wants to assess their data quality themselves.”

1. Be internally consistent

This is a basic one, but arguably one of the most critical. Simply being consistent with the way you input and store your data is going to have a massive effect on its quality.

“Let’s use a phonebook as an example. For this to be a good source of data, you want the names, numbers and locations to all be in the right columns. You want the positioning of the first and last names, the area codes, to also be in the same position throughout.

“We’ve seen duplicated data, data fields - like names and addresses - being used for different things, messy phone numbers and locations. This can really play havoc with any modelling you want to do.

“It might not seem like a big deal right now, having the odd entry incorrect or field name mishap, but these small errors all add up and may hold you back when you want to use your data in the future.”

2. Adopt standards

Some industries adopt data standards to help in their quest for good data.

"Data standards are more commonplace in certain sectors - like finance and medicine. The experience is often like going into a kitchen where the jars are labelled with the expiry date and ingredients. You have a much better idea of what you’re dealing with.”

There are plenty of great resources if you’re interested in learning more about these standards (ISO 8601 is Sarah’s personal favorite):

ONS data standards.

USGS data standards

There is a common reason why more companies don’t adopt these

“Projects usually start small, so people tend to think ‘why would I worry about standards now?’. But these projects grow - sometimes they grow very fast - and you’ll end up wishing you adopted some of these earlier in the process because it’s much harder to fix things further down the line.”

3. Hold onto the right data

Is it so bad to not keep hold of every bit of data? Probably not. But it’s important to consider what data will help you in the future.

“Lots of people don't keep certain data because they don't feel it's important. We’ve worked on projects where the company would often override data with no record or annotation, and would never do backups. But this can be critical when you start an AI project to improve decision-making, regardless of the business level you’re looking at.

“Historically, it made sense to not hold on to a lot of your data. A decade ago digital storage was really expensive and it was very costly to keep hold of every single piece. But since prices have plummeted in recent years, it’s become far cheaper to save data.

“But how do you know what to keep and what not to keep? You'll never get it perfectly right.  It's a difficult call that you have to make based on your instinct and judgement. Typically the people that work with that data everyday should be your first port of call - they can often have insights that you wouldn’t have considered.”

So, how’s your data looking?

It’s tough. Implementing a good data strategy won’t always benefit you in the short term, but it will certainly pay off in the long run.

“It’s easy to see why having great quality data isn’t a priority for some companies - it can take a lot of time, energy and money. But adopting best practices now will save you a lot of headaches in the future. When the moment arrives where you want to implement advanced technology like AI or machine learning, you’ll be in a much better position and get better results - and even if you aren’t planning on adopting AI anytime soon, it will greatly improve your business decision-making.”

©2022 Secondmind Ltd.