OS

Data quality
Data science

Data labeling and relabeling in machine learning

The never-ending process in data science


Supervised machine learning models are trained using data and their associated labels. For example, to discriminate between a cat and a dog present in an image, the model is fed images of cats or dogs and a corresponding label of “cat” or “dog” for each image. Assigning a category to each data sample is referred to as data labeling.

Data labeling is essential to imparting machines with knowledge of the world that is relevant to the particular machine learning use case. Without labels, models do not have any explicit understanding of the information in a given data set. A popular example that demonstrates the value of data labeling is the ImageNet data set. More than a million images were labeled with hundreds of object categories to create this pioneering data set that heralded the deep-learning era.

In this article, you’ll learn more about data labeling in machine learning and its use cases, processes, and best practices.

Learn the secrets of performant and explainable ML

Be the first to know by subscribing to the blog. You will be notified whenever there is a new post.

Why is data labeling important in machine learning?

Labeled data is necessary to build discriminative machine learning models that classify a data sample into one or more categories. Once a machine learning model is trained using data and corresponding labels, it can predict the label of a new unseen data sample. Data labeling is a crucial process as it directly impacts the accuracy of the model. If a significant proportion of the training data set is mislabeled, it will cause the model to make inaccurate predictions.

Data labeling of production data is also important to counter data drift. The model can be continuously improved by incorporating the newly labeled samples from the real-world data distribution into the training data set.

Poorly labeled data can also introduce bias in the data set, which can cause the models to consistently make inaccurate predictions on a subset of real-world data. Mislabeling can severely impact the fairness and accuracy of models and warrants additional efforts to detect and eliminate labeling errors. Relabeling helps to address mislabeled samples, improving the data quality and, consequently, the accuracy of the machine learning models.

How is data labeling performed?

Again, data labeling helps train supervised machine learning models that learn from data and their corresponding labels. For example, the following text, sourced from the Large Movie Review Dataset, can be annotated in a number of ways depending on the use case:

I saw this movie in NEW York city. I was waiting for a bus the next morning, so it was 2 or 3 in the morning. It was raining, and did not want to wait at the PORT AUTHORTY. So I went across the street and saw the worst film of my life. It was so bad, that I chose to stay and see the whole movie,I have yet to see anything else that bad since. The year was 69,so call me crazy. I stayed only because I could not belive it.........

1. Use case: Sentiment analysis

  • Label: [Negative]

2. Use case: Named entity recognition

  • Label (Place): [NEW York city], [PORT AUTHORTY]

3. Use case: Spelling correction

  • Label (Typo): [belive], [AUTHORTY]

For the named entity recognition use case, data annotators have to review the entire text and identify and label any mention of places.

Typically, data annotation is outsourced to vendors who contract subject matter experts relevant to the specific machine learning use case. The team of annotators is assigned different batches of data to label on a daily basis for the duration of the project, using simple tools like Excel or more sophisticated labeling platforms like Label Studio. Labelers’ performance is evaluated in terms of metrics like overall accuracy and throughput—i.e., the number of samples labeled in a day.

If the same set of data samples is assigned to multiple annotators, then the labels given by each annotator can be combined through a majority vote. Inter-annotator agreement helps to reduce bias and mislabeling errors.

For several use cases, data labeling can be extremely painstaking and time-consuming, which may lead to labeling fatigue. To counter this, labels assigned to each annotator undergo one or more rounds of review to catch any systematic errors. Once a batch of data is labeled, reviewed, and validated, it is shared with the data science team, who review select samples for labeling accuracy and verification and then provide feedback to the annotators. This iterative and collaborative process ensures that the final labels are of high quality and accuracy to use for training machine learning models.

How is data relabeling performed?

The repetitive and manual nature of data labeling is often fraught with errors. This necessitates the need to identify and relabel samples that were erroneously labeled the first time around. Relabeling is an expensive but necessary process as it is imperative to have a training data set of high quality. Unlike labeling, relabeling is usually done on a smaller sample of the entire data set and can be completed much faster if the samples are mislabeled in a unique way or associated with the same annotator.

Once a trained model is deployed, its predictions on real-world data can be evaluated. A detailed error-analysis process can sometimes reveal systematic prediction errors. Many times, these characteristic errors may be correlated with a certain type of data sample or feature. In such cases, having another look at similar samples in the training data can help identify mislabeled samples. More often than not, labeling errors on a certain segment of the training data can be captured through such error analysis and corrected with relabeling.

Best practices for data labeling in machine learning

Data labeling can be prohibitively expensive and time-consuming for large data sets. As model development is contingent on the availability of good-quality labeled data, poor labeling can affect the timelines and prolong the time to build and deploy machine learning models.

A good practice for data scientists is to curate a comprehensive data-annotation framework for each use case before starting the data-labeling process. Clear, structured guidelines with examples and edge cases provide much-needed clarity for annotators to do their job with greater speed and accuracy. In the absence of domain experts within the company, external experts can be sought to discuss and conceptualize guidelines and best practices for labeling specific types of data.

As labeling of large data sets by domain experts can be quite expensive, in specific cases, data labeling can be crowdsourced to thousands of users on platforms like Amazon Mechanical Turk. Typically, labeling by crowdsourced users is fast but often noisy and less accurate. Still, crowdsourcing can be a significantly quicker method of collecting the first set of labels before doing one or more rounds of relabeling to eliminate errors.

Error analysis is another recommended practice to diagnose model prediction errors and iteratively improve model performance. Error analysis can be done manually by the data scientists or with greater speed and reproducibility using machine learning debugging platforms like Openlayer.

Another good practice, in the context of very large data sets for deep learning applications, is to leverage machine learning to obtain a first pass of labels using techniques like the following:

Conclusion

Machine learning and deep-learning models are typically trained on large data sets. To train such models, a label for each data sample is necessary to teach the model about the information in the data set. Labeling, therefore, is an integral aspect of the machine learning lifecycle and directly influences the quality and performance of models in production.

In this article, you’ve seen the importance, process, and best practices for efficient data labeling and relabeling. Mislabeled data samples introduce noise and bias in the data set that adversely impact the performance of the model. Identifying mislabeled examples through error analysis is a proven technique to improve the quality of training data that can be accelerated using machine learning debugging and testing platforms like Openlayer.

* A previous version of this article listed the company name as Unbox, which has since been rebranded to Openlayer.

Sundeep Teki

Dr. Sundeep Teki is a leader in AI and neuroscience with professional experience in the US, UK, India, and France. He has published 40+ papers; built and deployed AI for consumer tech products like Amazon Alexa; advises and consults tech startups on AI/ML, product, and strategy; and coaches data and AI professionals and executives.

Recommended posts

Data science
Data quality

Surefire ways to identify data drift

Avoiding silent failure in production

Sundeep Teki

May 25th, 2022 • 6 minute read

Surefire ways to identify data drift
Data science
Data-centric
Testing

Detecting data integrity issues in machine learning

Methods and tools for data quality assurance

Gustavo Cid

June 20th, 2023 • 6 minute read

Detecting data integrity issues in machine learning
Data science
Machine learning

How to generate synthetic data for machine learning projects

Solving the data gap

Sundeep Teki

June 13th, 2023 • 10 minute read

How to generate synthetic data for machine learning projects