The most important yet unanswerable question in AI
The uncomfortable truth about how much data you need for using AI
Photo by Joshua Sortino on Unsplash
Data has a central role in the development of AI and ML models. These tools are trained on data; without data, there is no AI model. You may see this model training as a way to extract value from data; this suggests that, in many situations, having more data brings more value, that means better performances of our AI and ML algorithms.
However, collecting data, especially individual data points with specific label information, can be a challenge in many non-digital sectors like manufacturing and life sciences.
While e-commerce and digital marketing can easily collect millions of data points from users and people browsing the internet, things are very different when behind a data point there is a real-life object or sample. For instance, in molecular biology, omics technologies can collect many variables, but often the sample sizes is limited due to efforts required for setting up the experiments. In medicine, many studies might only involve a handful of patients because of the challenges associated with patient recruitment.
Manufacturing data comes with its own set of challenges: a new product might have no data, while existing products might suffer from rare anomalies, lack of label information, or missing measurements of crucial variables (e.g. what consumable substances are used in every shift, the fact if operators change or not, manual interventions, etc.).
Given the challenges in collecting data, before embarking in an AI project, it is very sensible to estimate how many data needs to be collected before an AI model can perform reasonably well. In domains where data is scarce and costly to gather, estimating the necessary sample size becomes crucial for budgeting.
However, answering the question “How much data we need for our AI project?” is not possible. A rule of thumb suggests that the collected data samples should be 10 times the number of variables. But this is merely a guideline, and we have already explained that the number of variables used could be wrong, because the collected variables could be missing very important ones and including irrelevant ones. Since data science is about answering questions based on data, we face the chicken-and-egg causality dilemma: we need data to determine if we have enough data for our goals.
If we do not know how much data we need for our AI project, how can we proceed? We recommend a data-centric approach, where AI performance improvement are obtained not primarily by changing the model but by changing the data; in our case, it means collecting the data iteratively:
- Based on previous data collection, and om the best of our knowledge, determine all informative variables we can measure in the upcoming data collection
- Collect new data; the number of samples depend on budget and goals
- Process both old and new data to create a dataset that best represents real-world scenarios
- On this newly-derived dataset, retrain an existing AI model or train a new AI model
- Deploy the trained AI model in production
- Measure regularly the performance of the AI model in production by using various metrics; if the measurements are unsatisfactory or start degrading, go back to point 1.
This iterative approach offers several benefits. As mentioned in the iteration point 2, by collecting data step by step, you can adjust and guide the collection process to make your dataset distribution closely resemble real-world distribution; you can collect data about cases for which you are missing data, or realize you should collect additional variables. For example, in medical settings you may realize that the demographics of your patients underrepresent some categories, or in an anomaly detection projects for manufacturing you are gathering data at times when anomalies are least likely.
But there is also an additional reason in favor of this iterative approach: counteracting the so called data drift. In many scenarios, unfortunately, older data becomes less and less representative of the present situation, so it is necessary to periodically retrain the AI model only on relatively recent data. By collecting data incrementally rather than in one go, we can better counteract the effects of the data drift: we can still make use of some of the data already collected. If all data were collected at once, the data drift would impact sooner and harder: the whole training data, and not just the oldest slices like in the iterative approach, would become obsolete.
In conclusion, AI needs data but it’s challenging to estimate how much data are needed. To save on budget, and to maintain good performances, it is advisable to collect new data regularly and monitor the performances of the AI model.
Fai gratuitamente il Data Maturity Quiz
Nel mondo della data science, capire a che punto siete è il primo passo verso il miglioramento. Siete curiosi di sapere quanto la vostra azienda sia veramente esperta di dati? Volete identificare le aree di miglioramento e valutare il livello di Data Maturity della vostra organizzazione? Se è così, ho lo strumento che fa per voi.
Presentazione del Data Maturity Quiz:
- Facile e Veloce: con sole 14 domande, potete completare il quiz in meno di 9 minuti.
- Valutazione completa: Ottenete una visione olistica della Data Maturity della vostra azienda. Comprendete i punti di forza e le aree che richiedono attenzione.
- Comprensione nel dettaglio: Ricevete un punteggio gratuito per ciascuno dei quattro elementi essenziali della Data Maturity. Questo fornirà un quadro chiaro di dove la vostra organizzazione eccelle e dove c'è spazio per il miglioramento.
Per diventare un'organizzazione veramente guidata dai dati è necessario un momento di introspezione. Si tratta di comprendere le capacità attuali, riconoscere le aree di miglioramento e tracciare il percorso da seguire. Questo quiz è stato ideato per fornirvi questi spunti.
Siete pronti a intraprendere questo viaggio?
Fate subito il Quiz sulla Data Maturity!
Ricordate, la conoscenza è potere. Capendo a che punto siete oggi, potete prendere decisioni informate per un futuro migliore e guidato dai dati.