The most important yet unanswerable question in AI
The uncomfortable truth about how much data you need for using AI
Photo by Joshua Sortino on Unsplash
Data has a central role in the development of AI and ML models. These tools are trained on data; without data, there is no AI model. You may see this model training as a way to extract value from data; this suggests that, in many situations, having more data brings more value, that means better performances of our AI and ML algorithms.
However, collecting data, especially individual data points with specific label information, can be a challenge in many non-digital sectors like manufacturing and life sciences.
While e-commerce and digital marketing can easily collect millions of data points from users and people browsing the internet, things are very different when behind a data point there is a real-life object or sample. For instance, in molecular biology, omics technologies can collect many variables, but often the sample sizes is limited due to efforts required for setting up the experiments. In medicine, many studies might only involve a handful of patients because of the challenges associated with patient recruitment.
Manufacturing data comes with its own set of challenges: a new product might have no data, while existing products might suffer from rare anomalies, lack of label information, or missing measurements of crucial variables (e.g. what consumable substances are used in every shift, the fact if operators change or not, manual interventions, etc.).
Given the challenges in collecting data, before embarking in an AI project, it is very sensible to estimate how many data needs to be collected before an AI model can perform reasonably well. In domains where data is scarce and costly to gather, estimating the necessary sample size becomes crucial for budgeting.
However, answering the question “How much data we need for our AI project?” is not possible. A rule of thumb suggests that the collected data samples should be 10 times the number of variables. But this is merely a guideline, and we have already explained that the number of variables used could be wrong, because the collected variables could be missing very important ones and including irrelevant ones. Since data science is about answering questions based on data, we face the chicken-and-egg causality dilemma: we need data to determine if we have enough data for our goals.
If we do not know how much data we need for our AI project, how can we proceed? We recommend a data-centric approach, where AI performance improvement are obtained not primarily by changing the model but by changing the data; in our case, it means collecting the data iteratively:
- Based on previous data collection, and om the best of our knowledge, determine all informative variables we can measure in the upcoming data collection
- Collect new data; the number of samples depend on budget and goals
- Process both old and new data to create a dataset that best represents real-world scenarios
- On this newly-derived dataset, retrain an existing AI model or train a new AI model
- Deploy the trained AI model in production
- Measure regularly the performance of the AI model in production by using various metrics; if the measurements are unsatisfactory or start degrading, go back to point 1.
This iterative approach offers several benefits. As mentioned in the iteration point 2, by collecting data step by step, you can adjust and guide the collection process to make your dataset distribution closely resemble real-world distribution; you can collect data about cases for which you are missing data, or realize you should collect additional variables. For example, in medical settings you may realize that the demographics of your patients underrepresent some categories, or in an anomaly detection projects for manufacturing you are gathering data at times when anomalies are least likely.
But there is also an additional reason in favor of this iterative approach: counteracting the so called data drift. In many scenarios, unfortunately, older data becomes less and less representative of the present situation, so it is necessary to periodically retrain the AI model only on relatively recent data. By collecting data incrementally rather than in one go, we can better counteract the effects of the data drift: we can still make use of some of the data already collected. If all data were collected at once, the data drift would impact sooner and harder: the whole training data, and not just the oldest slices like in the iterative approach, would become obsolete.
In conclusion, AI needs data but it’s challenging to estimate how much data are needed. To save on budget, and to maintain good performances, it is advisable to collect new data regularly and monitor the performances of the AI model.
Take the Free Data Maturity Quiz
In the world of data science, understanding where you stand is the first step towards growth. Are you curious about how data-savvy your company truly is? Do you want to identify areas of improvement and gauge your organization’s data maturity level? If so, I have just the tool for you.
Introducing the Data Maturity Quiz:
- Quick and Easy: With just 14 questions, you can complete the quiz in less than 9 minutes.
- Comprehensive Assessment: Get a holistic view of your company’s data maturity. Understand the strengths and areas that need attention.
- Detailed Insights: Receive a free score for each of the four essential data maturity elements. This will provide a clear picture of where your organization excels and where there’s room for growth.
Taking the leap towards becoming a truly data-driven organization requires introspection. It’s about understanding your current capabilities, recognizing areas of improvement, and then charting a path forward. This quiz is designed to provide you with those insights.
Ready to embark on this journey?
Take the Data Maturity Quiz Now!
Remember, knowledge is power. By understanding where you stand today, you can make informed decisions for a brighter, data-driven tomorrow.