Data Preparation 2017-11-15T17:48:20+00:00

When it comes to machine learning, quality and volume of training data is crucial, no matter how good your technology. Poor data preparation can make even the best machine learning technologies look inept.  

What is data preparation?

The preparation of data is necessary in the construction of artificial neural networks and other machine learning algorithms, to mimic human intelligence for use in a variety of tasks like detection, prediction, computer vision and speech recognition. This is one of many components in the field of artificial and machine intelligence.

A key feature of neural networks and other machine learning algorithms is an iterative learning process in which data are presented to the network for ingestion in a piecemeal and structured manner, one piece of data at a time. Once a network has been structured for an application, that network is ready to be trained, or taught. Data preparation may involve activities such as data cleansing, data wrangling, data annotation, indexing, clustering, categorizing, classifying and labeling. Sigma does this type of work regularly for audio, video, image, speech, text, and biometric data, and is source agnostic with capabilities in over 60 languages.

The Challenge

Our client a multinational Fortune 100 Technology Company that designs, develops, and sells consumer electronics, computer software, and online services needed to enhance large volumes of existing recorded data to supply a supervised learning process to train deep learning (DL) algorithms, used for their intelligent agent system. They required a near perfect level of accuracy which was beyond their current capabilities, within a cost range that made sense, was scalable and could be done quickly enough that it didn’t impact their ability to serve their customers.

The Solution

With minimal set-up time utilizing a combination of our own software products and leveraging the technology of our client we were able to quickly, efficiently and cost-effectively increase their accuracy rate from 95% to 99.2%. This increase in accuracy is paramount as the underlying technologies and algorithms are heavily dependent on near perfect data annotation. Even the best technology will fail (in the form of lower than human accuracy) with poor data.

Sigma prepared the data for this use case using over 2,000 trained freelance linguists and annotation technicians. This freelance scaling capability to modify the size of the workforce on a per project basis is one reason why we can keep costs low. Another is our proven quality assurance process, tested over the nine years Sigma has been around.

The human linguist and annotator in this case study utilized a list of conventions to transcribe, tag, index, and classify audio files.  This involved not only typing the words heard in an audio file, but also identifying the need for punctuation, capitalization, contractions, acronyms, accents, semantics and deciphering fragments and filler words (such as “uh” and “umm”). It also involved the use of tags and markers to highlight areas of interest. Categorization of speaker gender, sentiment, topic, and function are also noted.  This process was repeated for thousands of hours of audio files and tens of thousands of different speakers in a variety of languages and dialects.

Why Sigma?

Compared to many other vendors in this arena, we’ve been told our prices are “very interesting” and that we have the highest accuracy rate. With operations being run from our Madrid office, (HQ is in San Francisco), overhead is low and allows Sigma to stay competitive with pricing. We offer capabilities in the highest number of languages (62), and, have perfected a unique quality assurance process that helps us achieve the highest levels of accuracy. The global Fortune 100 client in this case study has now become a repeat customer for several years.