Tech

Working on Microsoft’s Cortana Is Laborious and Poorly Paid

Leaked documents show that Microsoft’s contractors are paid between $12 and $14 an hour and are asked to transcribe as many as 200 audio clips per hour to train the Cortana virtual assistant.
Microsoft logo
Image: David Paul Morris/Bloomberg via Getty Images

"Stop listening to me" is one example of a command a Cortana user may utter, according to a training manual for the human contractors Microsoft hires to listen to and classify users' speech.

Apple, Google, Amazon, and most recently Facebook have been found hiring human workers to transcribe audio captured by their own products. Motherboard found Microsoft does the same for some Skype calls, and is still doing so despite other companies suspending their reliance on contractors.

Advertisement

A cache of leaked documents obtained by Motherboard gives insight into what the human contractors behind the development of tech giants' artificial intelligence services are actually doing: laborious, repetitive tasks that are designed to improve the automated interpretation of human speech. This means tasks tech giants have promised are completed by virtual assistants and artificial intelligence are trained by the monotonous work of people.

The work is magnified by the large footprint of speech recognition tools: Microsoft's Cortana product, similar to Apple's Siri, is implemented in Windows 10 machines and Xbox One consoles, and is also available as on iOS, Android, and smart speakers.

"The bulk of the work I've done for Microsoft focused on annotating and transcribing Cortana commands," one Microsoft contractor said. Motherboard granted the source anonymity to speak more candidly about internal Microsoft processes, and because they had signed a non-disclosure agreement.

Do you work as a contractor or employee for a tech giant? Did you used to? We'd love to hear from you. You can contact Joseph Cox securely on Signal on +44 20 8133 5190, Wickr on josephcox, OTR chat on jfcox@jabber.ccc.de, or email joseph.cox@vice.com.

The instruction manuals on classifying this sort of data go on for hundreds of pages, with a dizzying number of options for contractors to follow to classify data, or punctuation style guides they're told to follow. The contractor said they are expected to work on around 200 pieces of data an hour, and noted they've heard personal and sensitive information in Cortana recordings. A document obtained by Motherboard corroborates that for some work contractors need to complete at least 200 tasks an hour.

Advertisement

The pay for this work varies. One contract obtained by Motherboard shows pay at $12 an hour, with the possibility of contractors being able to reach $13 an hour as a bonus. A contract for a different task shows $14 an hour, with a potential bonus of $15 an hour.

One section of the training materials focuses especially on how the trigger command "Hey, Cortana" is pronounced in different languages and accents, including German, Chinese, Japanese, and Australian, Canadian, and American variations of English.

Notably, one document tells contractors to transcribe a word as "Cortana" even if the user mispronounced Cortana as, say, "Cortona" or "Cortina," because, Microsoft believes, activating Cortana was the intent.

"There are tasks where we're required to clearly capitalize proper names that relate to a contact, or other personal info," the contractor said.

A Microsoft spokesperson told Motherboard in an emailed statement, "We’re always looking to improve transparency and help customers make more informed choices. Our disclosures have been clear that we use customer content from Cortana and Skype Translator to improve these products, we engage third party expertise to assist in this process, and we take steps to de-identify this content to protect people’s privacy."

After Motherboard reported that contractors were listening to some Skype calls made using the service's translator function, Microsoft updated its privacy policy and other pages to explicitly include that humans may listen to collected audio.

Advertisement

As for the work itself, one main task for contractors working with Cortana data is to classify it.

Contractors are asked to bucket each transcription into a "domain" or topic. These over two dozen domains include "Calendar" for anything around appointments; "Alarm" for commands related to timers or alarms; and "Capture" for tasks that involve using the camera. Other domains include gaming, email, communication, feedback, events, home automation, note, media control, and "Orderfood," according to the documents. The "common" domain is for generic commands that could fit into more than one domain, the documents add.

Each domain then has several different "intents." For the Alarm domain, that includes set alarm, turn off alarm, find alarm, change alarm, snooze, set timer, find timer, and more.

Microsoft's human contractors analyze these Cortana commands, and then decide the appropriate domain and intent. Another document shows how intents are frequently removed or added to different domains, giving contractors more classifiers to work with.

Some audio also relates to "double intent," where a user is asking Cortana to complete two tasks at once, which a contractor also has to look out for, the documents adds.

One intent contractors are asked to classify data under is called "are_you_listening."

Subscribe to our new cybersecurity podcast, CYBER.