Applied AI & AI in business /

Seven things you need to know when collecting data from the real world

May 11 2021/5 min read

Lately, I’ve been running some projects where the data simply didn’t exist when the project started. Basically, I sort of like this -  I know I shouldn’t, because it makes the project A LOT more complicated than if you were to start with a huge, available and preferably already pre-processed data set. But collecting data from the real world is a cross-functional challenge and a lot of “general engineering” skills are needed. And I’m an engineer.

And it’s fun! Not only do you get to run AI-projects -  you get to deep-dive into construction machinery, production lines and product development, and a million other interesting domain-specific topics. 

Of course, as a domain expert reading this you might be very familiar with all this, and less familiar with the world of data and AI. Hopefully, you’re about to start some AI exploration and have an idea of how that can improve your process.

Why do you need to collect data?

If you just want to predict a future value based on some transaction data you already have (like future order stock value etc.), well then you probably don’t - just use some classic ML-algos and the data you have and go for it.

But if you’re chasing the AI super skill that will change the way you do things and really help your experts - then you most likely need more data. Data you didn’t have at all because the process might have been 100% non-digital before.

The 7 things you need to think about when collecting data

There’s obviously more than seven things to think about when collecting data in the real world, but when looking back at some of the projects I’ve done lately, these were the things I think made the biggest impact.

1. The data is real

This means that the data comes from a real process, i.e., a real machine. You can basically touch it and smell it. It’s a process that’s impossible to capture ALL data from, but once you think through which aspects are important, it’s doable!

Example: It’s impossible to capture ALL data from operating a chainsaw (which could be what the “high level” idea was in the beginning), but when you think about it, the interesting data is probably vibrations, sound, power usage and the speed of the chain. These can all be collected.

2. The real world contains infinite variations 

The data comes from a world of infinite variations.When trying new things out or by doing a proof of concept, you probably can’t collect a million samples to start with since that’s pretty time-consuming. Instead, you probably want to start out with a dataset that you can collect in a couple of days/weeks or so. The first step here is to get to know the data in order to get a feeling of what could be done.

To do this, you have to reduce the variations, so that you can focus your efforts on only one varying parameter in the data.

Example: To get to know the variations in the noise the chainsaw makes based on what materials you are cutting through, the vibrations, power usage and speed of the chain need to be fixed conditions. So there is only one parameter changing in your data set.

3. Educate and include everybody involved 

Using skilled machine operators to collect the data is essential, but don’t forget that data science is nowhere intuitive for machine operators (or anybody else outside the data world). They know their machines on the back of their hands. But what is good and bad for data collection is something completely different. Put in some extra, extra time to make sure everybody involved understands the basic concepts of the “why, what and how” in your desired data set. And never, never “assume” anything.

Also, be there and be present when collecting the data. This gives you the benefit of actually understanding the process you want to improve.

Example: For the operator switching to another make/model of the chain in the middle of the data set (because it broke) it means no difference since they might perform the same. But they could potentially make different noises (even though it may not be noticeable to the human ear). The result would in this case be difficulties in knowing if the variations come from different materials or different chains.

4. Set up a data collection plan

Think of this as the operating schedule for a big conference. With the add-on that it should be repeatable and done every time with a completely different crew of people. That’s your data collection plan. 

It is also important to add notes, tips and tricks to the plan when doing the data collection to make sure you don’t fall in the same trap again.The plan should at least contain topics covering the where, what, when and how of collecting data (this could be an entire post of its own).

Here, it is also very important to include the varying/fixed conditions. If you think of it as repeatable and totally operator agnostic you’re on the right path. Also, make sure the measurements are done by skilled people.

5. Make sure the data is usable

Think before, not after on data synchronization, labeling, formating, etc. The data collection process is cumbersome and you don’t want to come back the next day just to say, “you know we forgot something”. 

Example: Sometimes synchronizing between different sensors is as simple as using the traditional “movie clap” to sync between audio and video, or banging with a hammer on the chassis to sync between audio and vibrations.

The best case, of course, is when everything is connected to the same recording device, but as it seems. Early POC-work is never “best case”.

6. Accept that it’s a learning process

Schedule for not only the 1st data collection but schedule the 2nd run right away - It’s almost certain you will have forgotten something after the 1st. Or that you just learned something that makes you want to do it all again. It’s natural - so plan for it.

7. Over-equip with sensors

Adding another camera, microphone, thermometer, or accelerometer is often easier - than having to do it all over again. It’s often also not that expensive with hardware in comparison to people’s time.

Finally - visualize the process

Congratulations - You’re all set, almost. The next step and final recommendation is to actually visualize the entire process by doing a walk-through/dry-run of every step with both the operators and the data scientists, to make sure you have a joint understanding of what’s going to happen.

Good luck!

  • Björn Treje

    Björn Treje

    Head of Technical Enablement

    Björn has a Master of Science in Electrical Engineering. He strives to put engineering into the business and business into the engineer. Secretly he hopes all projects involves helmets or reflex vests at some point.

02/ More on Business & Applied AI