Kaggle, an online data science community that regularly hosts machine learning competitions with prizes often in the tens of thousands of dollars, has uncovered a cheating scandal involving a winning team. The Google subsidiary announced on Friday that the winner of a competition involving a pet adoption site had been disqualified from the contest for fraudulently obtaining and obscuring test set data.
The fact that a team cheated in a competition nominally intended to help shelter animals also raises questions about whether the people who participate in machine learning competitions like Kaggle are actually interested in making the world a better place, or whether they simply want to win prize money and climb virtual leaderboards.
The competition asked contestants to develop algorithms to predict the rate of pet adoption based on pet listings from PetFinder.my, a Malaysian pet adoption site. The goal, according to the competition, was to help discover what makes a shelter pet's online profile appealing for adopters. The winning team's entry would be "adapted into AI tools that will guide shelters and rescuers around the world on improving their pet profiles' appeal, reducing animal suffering and euthanization," the competition site said.
The algorithm from BestPetting, the first place team, seemed to almost perfectly predict the rate of adoption for the test set against which the submissions were evaluated, winning with a nearly perfect score of 0.912 (out of 1.0). As a reward for their winning solution, the team of three was awarded the top prize of $10,000.
Nine months after the close of the competition, however, one observant teenager found that the impressive results were too good to be true. Benjamin Minixhofer, an Austrian machine learning enthusiast who placed sixth in the pet adoption competition, volunteered to help the company integrate the winning solutions into PetFinder.my’s website. In doing so, he discovered that the BestPetting team obtained PetFinder.my’s testing data, likely by scraping data from Kaggle or PetFinder.my, then encoded and decoded that data into their algorithm to obfuscate their illicit advantage.
"Only some of the encoded answers were used, so as to keep their final score 'realistic,'" Andy Koh, the founder of PetFinder.my, wrote in a post explaining that the team had been disqualified. "It is very sad indeed that such brilliant people, including a highly respected Kaggle Grandmaster, have gone to such lengths to defraud a welfare competition aimed at saving precious animal lives, solely for their own financial gain."
Minixhofer is one of several volunteers who worked with PetFinder.my to implement the winning algorithms, but told Motherboard that, "as far as I know, I am the only one who stuck through with helping them.” He noted that because PetFinder.my wanted to use the winning results to improve pet profiles, and not simply to predict the speed of pet adoption, it was more arduous and time consuming to implement than simply adding a machine learning service.
“I was also still in high school when the competition ended," Minixhofer said. "So I was only able to work with PetFinder.my on the side.”
The cheating was also difficult to uncover because BestPetting disguised most of its encoding and decoding in layers upon layers of function calls and return values, most of which had seemingly mundane, common names like “get_dict” (a dictionary is a data type in Python) or “process.” In addition, the team was careful to only swap in the data it scraped once for every ten pets, to avoid raising suspicions with an absolutely perfect result. By Minixhofer’s calculations, “their submission would have scored [about] 100th place with a score of 0.427526 without the cheat.”
Cheating is not uncommon in Kaggle competitions, where, for some, the glory of attaining ranks like “Expert” and “Grandmaster” is as important as the exorbitant cash prizes. But many in the data science community are especially shocked by the level of effort that went into the scam, and the fact that several of the participants had high ranks on Kaggle. Data scientist Pavel Pleskov was previously a top-ranking Kaggle Grandmaster with many previous victories under his belt.
Pleskov has been banned permanently from Kaggle, as “evidence points towards him being the key party behind this fraudulent activity.” On Twitter, Pleskov apologized on behalf of his team, and noted that he intended to return the prize money to PetFinder.my. “For me, it was never about the money but rather about the Kaggle points: a constant struggle of becoming #1 in rating had compromised my judgment,” he wrote. “I hope at least some of you forgive me and hope other competitors will learn from my mistakes.”
Kaggle declined to comment for this article, but we were referred to this post by Kaggle data scientist Walter Reade: "Cheating, in any form, erodes the awesomeness of the Kaggle community. Because of recent events, I'd like to re-express and reinforce Kaggle's stance on cheating."
In addition to losing his title as Grandmaster, Pleskov also lost his job at the open source software company H2O.ai, which specifically highlights its employment of Kaggle Grandmasters on its website.
“The behavior and actions that we became aware of [Saturday] regarding the Kaggle competition do not reflect the company's values," the company said. "This individual participated in this competition prior to his employment with us. We conducted an investigation and this person is no longer affiliated with H2O.ai.”
Minixhofer said the event was an indication that Kaggle must do more to discourage nefarious behavior. In his eyes, this should be done by requiring all solutions to be public and open source. Though competitions rules state that solutions must be open source, this only means that the algorithms be developed under an open source license, not that they are made public.
“That is a loophole in the rules that is even misunderstood by competition hosts,” he said. Implementing new rules to that effect “would prevent these incidents in the future.”