Overcome the challenges of working with small data

To further reinforce our commitment to providing cutting-edge data technology coverage, VentureBeat is delighted to welcome Andrew Brust and Tony Baer as regular contributors. Watch for their posts in the data pipeline.

Have you had trouble with airplane seats because you're too tall? Or maybe you couldn't reach the top shelf of the supermarket because you're too short? Anyway, almost all of these things are designed with the average height of a person in mind: 170cm – or 5' 7″.

In fact, almost everything in our world is designed around averages.

Most companies only use averages because they cover the majority of cases. They allow companies to reduce their production costs and maximize their profits. However, there are many scenarios where covering 70-80% of cases is not enough. As an industry, we need to understand how to deal effectively with the remaining cases.

In this article, we will discuss the challenges of working with small data in two special cases: when datasets contain few inputs in general, and when dealing with poorly represented subparts of data. larger and biased datasets. You will also find applicable advice on how to approach these problems.
Event
Low-Code/No-Code vertex

Join today's top leaders at the Low-Code/No-Code Summit virtually on November 9. Sign up for your free pass today.
register here What is small data?
It is important to first understand the concept of small data. Small data, as opposed to big data, is data that arrives in small volumes that are often understandable to humans. Small data can also sometimes be a subset of a larger data set that describes a particular group.
What are the issues with small data for real life tasks?
There are two common scenarios for small data challenges.

Scenario 1: The data distribution describes the outside world pretty well, but you just don't have a lot of data. It may be expensive to collect, or it may describe objects that are not commonly seen in the real world. For example, breast cancer data for younger women: you'll probably have a reasonable amount of data for white women ages 45-55+, but not for younger ones.

Scenario 2: You may be creating a translation system for one of the low-resource languages. For example, much of the data available in Italian is available online, but with Rhaeto-Romance languages the availability of usable data is more complicated.
Problem 1: The model becomes prone to overfitting
When the dataset is large, you can avoid overfitting, but it's much more difficult with small datasets. You risk creating an overly complicated model that fits your data well, but isn't as effective in real-world scenarios.

Solution: Use simpler templates. Usually, when working with small data, engineers are tempted to use complicated models to perform more complicated transformations and describe more complex dependencies. These models won't help you solve your overfitting problem when your dataset is small and you don't have the luxury of just feeding more data to the algorithm.

Besides overfitting, you might also notice that a model trained on small data doesn't converge very well. For such data, premature convergence can present a huge problem for developers, as the model fails very quickly in local optima and it is difficult to recover from.

In this scenario, it is possible to oversample your data set. There are many algorithms such as classical sampling methods like Synthetic Minority Oversampling Technique (SMOTE) and its modern modifications and neural network based approaches like Generative Adversarial Networks (GAN). The solution depends on how much data you actually have. Often stacking can help you improve metrics and not overfit.

Another possible solution is to use transfer learning. Transfer learning can be used to effectively develop...

Startups Oct 28, 2022 0 75 Add to Reading List

Overcome the challenges of working with small data

To further reinforce our commitment to providing cutting-edge data technology coverage, VentureBeat is delighted to welcome Andrew Brust and Tony Baer as regular contributors. Watch for their posts in the data pipeline.

Have you had trouble with airplane seats because you're too tall? Or maybe you couldn't reach the top shelf of the supermarket because you're too short? Anyway, almost all of these things are designed with the average height of a person in mind: 170cm – or 5' 7″.

In fact, almost everything in our world is designed around averages.

Most companies only use averages because they cover the majority of cases. They allow companies to reduce their production costs and maximize their profits. However, there are many scenarios where covering 70-80% of cases is not enough. As an industry, we need to understand how to deal effectively with the remaining cases.

In this article, we will discuss the challenges of working with small data in two special cases: when datasets contain few inputs in general, and when dealing with poorly represented subparts of data. larger and biased datasets. You will also find applicable advice on how to approach these problems.

Event

Low-Code/No-Code vertex

Join today's top leaders at the Low-Code/No-Code Summit virtually on November 9. Sign up for your free pass today.

It is important to first understand the concept of small data. Small data, as opposed to big data, is data that arrives in small volumes that are often understandable to humans. Small data can also sometimes be a subset of a larger data set that describes a particular group.

What are the issues with small data for real life tasks?

There are two common scenarios for small data challenges.

Scenario 1: The data distribution describes the outside world pretty well, but you just don't have a lot of data. It may be expensive to collect, or it may describe objects that are not commonly seen in the real world. For example, breast cancer data for younger women: you'll probably have a reasonable amount of data for white women ages 45-55+, but not for younger ones.

Scenario 2: You may be creating a translation system for one of the low-resource languages. For example, much of the data available in Italian is available online, but with Rhaeto-Romance languages the availability of usable data is more complicated.

Problem 1: The model becomes prone to overfitting

When the dataset is large, you can avoid overfitting, but it's much more difficult with small datasets. You risk creating an overly complicated model that fits your data well, but isn't as effective in real-world scenarios.

Solution: Use simpler templates. Usually, when working with small data, engineers are tempted to use complicated models to perform more complicated transformations and describe more complex dependencies. These models won't help you solve your overfitting problem when your dataset is small and you don't have the luxury of just feeding more data to the algorithm.

Besides overfitting, you might also notice that a model trained on small data doesn't converge very well. For such data, premature convergence can present a huge problem for developers, as the model fails very quickly in local optima and it is difficult to recover from.

In this scenario, it is possible to oversample your data set. There are many algorithms such as classical sampling methods like Synthetic Minority Oversampling Technique (SMOTE) and its modern modifications and neural network based approaches like Generative Adversarial Networks (GAN). The solution depends on how much data you actually have. Often stacking can help you improve metrics and not overfit.

Another possible solution is to use transfer learning. Transfer learning can be used to effectively develop...