Like any input-dependent system, machine learning (ML) is subject to the "garbage in - garbage out - garbage in" axiom. Clean and accurately labeled data is the basis for building any ML model. The ML learning algorithm understands patterns from valid data and learns to generalize from unseen data. If the quality of your training data is poor, it will be challenging for the ML algorithm to learn and extrapolate continuously.
Think of it in terms of training a pet dog. Suppose you fail to properly train your dog on basic behavioral commands (input data) or do it incorrectly/inaccurately. In that case, you can never expect the dog to learn and grow through observation into more complex positive behaviors because the input data was initially missing or deficient. Proper training is time-consuming and even costly if you bring in an expert, but if you do it right from the start, you will get a big payoff.
When training an ML model, creating quality data requires that an expert in the field take the time to annotate the data. This may involve highlighting a window with the desired object in the image or assigning a label to a text record or database record. For unstructured data such as images, video, and text, annotation quality plays a vital role in determining model quality. Typically, unlabeled data, such as raw images and text, are abundant, but labeling it is where the effort must be optimized. This is the human element part of the ML lifecycle and is usually the most expensive and time-consuming part of any ML project.
Data annotation tools such as Prodigy, Amazon Sagemaker Ground Truth, NVIDIA RAPIDS, and DataRobot human-in-the-loop are constantly improving quality and providing intuitive interfaces for experts. But minimizing the time it takes domain experts to annotate data is a significant challenge for enterprises - especially in an environment where data science talent is limited but in high demand. This is where two new approaches to data science training come into play.
Active learning
Active learning is a method in which the ML model asks an expert for specific annotations. The focus is not on getting a complete annotation of the unlabeled data but simply annotating the right data points so that the model can learn better. Take Healthcare & Life Sciences, which specializes in early cancer diagnosis to help doctors make data-driven, informed decisions about patient treatment. As part of the diagnostic process, they need to annotate CT scans with tumors to be isolated.
Once the ML model is trained on multiple images with tumor blocks highlighted, with active learning, the model will only ask users to annotate images when unsure about a tumor's presence. These will be borderline points whose annotation will increase the model's confidence. If the model's confidence exceeds a certain threshold, it will annotate independently rather than ask the user to annotate. Thus, active learning attempts to help build accurate models while reducing the time and effort required to annotate the data.
Mechanisms such as modAL can help improve classification efficiency by intelligently querying domain experts to mark the most informative instances.
Weak oversight
Weak supervision is an approach in which noisy and inaccurate data or abstract concepts can be used to obtain instructions for labeling large amounts of data without supervision. This approach typically combines weak labelers into an ensemble approach to create quality annotated data. The effort is to try to incorporate domain knowledge into automated labeling activities.
ML helps companies exponentially scale processes that are physically impossible to achieve manually at its core. However, ML is not magic, and it is still up to the individual to a) set up and train the models correctly from the start and b) intervene when necessary so that the model doesn't skew so much that the results are no longer helpful and can be counterproductive or harmful.
The goal is to find ways to simplify and automate some of the human involvement to increase time to market and results, but without going beyond optimal accuracy. It is generally recognized that obtaining quality annotated data is an ML project's most expensive but essential part. This is an evolving area, and many efforts are underway to reduce the time spent by domain experts and improve the quality of annotated data. The study and use of active learning and weak control is a solid strategy for achieving this goal across industries and use cases.