Steal this article, bots. I dare you

0
72
Graphic by Sean Mullins.

I recently discovered the shocking truth that bots are stealing articles from the Webster Journal. Naturally, I decided on the only logical response: Write an article about it that will inevitably be stolen itself.

On Monday, Oct. 10, I was scrolling through article comments on the Journal website’s dashboard – because that’s what you do when you have no life – and discovered several pingbacks. When other websites embed links to our articles into text like this, WordPress notifies us. However, these articles weren’t being shared. They were being copied and pasted by scam websites – or, to use a colloquialism, “borrowed indefinitely.”

This phenomenon is the result of scrapers, bots that invade websites and search engines for digital content to repost. We only noticed these stolen articles because, in their infinite wisdom, the bots accidentally copied embedded links to the originals. Veteran and new writers from every section fell victim to scrapers; apparently, the bots really liked my critique of dating apps, because it got pingbacks from four unique imposters.

Seeing these bootleg articles was like real-world “can I copy your homework” memes. They differentiated themselves with errors (“Dine and Discuss” became “Dine and Discus”) or awkwardly replaced synonyms (“Students take hands-on learning to new level” became “College students take studying to a brand new stage”). Most notably, my review of “Shovel Knight Dig” was poorly Google Translated into German, resulting in this S-tier Tweet with one like.

Photo by Sean Mullins. The Journal’s WordPress website received pingbacks from news scraper bots plagiarizing students’ articles, complete with slightly changed article titles.

We got lucky that the scrapers were stupid enough to embed links, but who knows how long they targeted us without our knowledge? How many of our articles have been purloined by malware-infested sites? For all I know, there’s a bootleg version of my “Deltarune: Chapter 2” review that downloads Spamton G. Spamton onto your hard drive when clicked. I had to investigate further.

According to a MediaShift report by Rami Essaid, 2015 was the first year in which more web traffic came from bots than humans, with roughly 59% of website visits being automated. Essaid distinguished 36% of web traffic that year as “good bots,” including search engines and social media aggregators that benefit websites, whereas “bad bots” like news scrapers made up 23% while pillaging across the internet.

“[Bad bots] can drain visitors from a site, damage their SEO rankings and cost advertising revenue,” Essaid said. “And because of all the bandwidth bad bots use while stealing content, they can make pages slower to load, annoying human visitors and further harming search-engine rankings.”

Photo by Morgan Smith. A sign inside Sverdrup Hall Room 116, where the Journal staff writes, edits, pitches and publishes articles.

From pilfering clicks to harvesting data, it only takes minutes for scrapers to steal weeks worth of honest work. This isn’t limited to text content like Journal articles; image and video content can also be plagiarized. Essaid noted that bot prevention software can fix problems on an individual level, but preventing this systemic issue with digital publishing is implausible when laws like the Digital Millennium Copyright Act lack effective enforcement.

Even major publications aren’t immune to scrapers. After the shady website Newsbuzzr posted bootleg versions of her articles, HuffPost senior reporter Jesselyn Cook shared her trip down the rabbit hole of bot plagiarism. Reading her article felt like a beat-by-beat description of everything I saw on our pingbacks: randomly placed synonyms that break otherwise coherent sentences, clearly unsafe scam websites, occasional links to the original articles, etc.

Cook’s search showed her several websites that stole content from every big-name publisher imaginable – from The New York Times to Wired – and monetized it with revenue programs like Google AdSense. Although Google later informed Cook that Newsbuzzr’s quality guideline violations got the site blocked from AdSense, these sites abused a broken system that requires manual reaction instead of proactively fighting scrapers. Therein lies the motive for article theft: profit.

“At a surface level, this rip-reword-repost operation is a creative little scam (and yes, it has produced some truly excellent “Florida Male” content),” Cook said. “But it is alarming that it’s evidently profitable to scrape content for ad traffic in this way, and the scheme illustrates just how skewed the economic incentives of our click-driven media industry are.”

To learn more about how scrapers work and what solutions exist, I reached out to webmaster, writer and plagiarism/copyright consultant Jonathan Bailey. Scraping operations popped up during the internet’s rapid growth in the early 2000s, which is when Bailey saw his journalistic writing frequently plagiarized. This galvanized him to share “techniques for detecting and stopping online content misuse” by launching his website, Plagiarism Today, in 2005.

Google dealt a serious blow to scrapers starting in February 2011 with several algorithm updates, including Google Panda, which promotes high-quality websites with original, thoroughly researched content. Although scrapers temporarily became less common after these changes, they never disappeared completely. Bailey explained that scrapers made a recent resurgence since they’ve become cheaper and more convenient. They’re designed to be as hands-free and effortless as possible, making them appear lucrative.

“Even if it doesn’t work 99.9% of the time, all one has to do is create 1,000 websites hosting spun content to find some degree of success. Spam, all kinds of spam, is always a numbers game and the numbers have simply started to favor those sites more,” Bailey said.

Because of how customizable they are, scrapers employ a wide range of tactics, from grabbing keywords to tracking RSS feeds. Some scrapers copy individual pages and shove them into synonym generators, while others combine sentences from different pages into what Bailey described as “Frankenstein articles that read like incoherent goop.” Monetization methods also vary from AdSense revenue to signal-boosting other suspicious websites with links, or even ads from other scammers.

It felt slimy to hear that scammers could profit off of student journalists’ work. They’re not even good enough to be my fakes. However, Bailey reassured me that they’re probably scraping the bottom of the barrel. The few companies that advertise on scam websites pay very little, and any meager revenue probably goes to a higher authority.

“I’m confident that [scrapers] are making an ill-gotten living, but they are generally middle people either for clients wanting unscrupulous SEO or advertisers peddling questionable products,” Bailey said. “Those groups are likely doing much better.”

So, what can digital publishers do to fight back? Anyone can file copyright notices with Google or web hosts, and Google routinely updates its algorithms to demote low-quality websites that irritate search-engine users. That said, Google itself is fallible on rare occasions, sometimes incorrectly flagging scrapers as the original source and punishing their victims.

One reason why creators might not pursue scrapers is that fighting them requires investing either time or money, which are much more limited for smaller publications. Bailey advises putting those resources toward a targeted approach; while finding every scraper is unreasonable and ineffective, searching for copied articles that rank better in Google will provide the most valuable targets with less effort.

Photo by Morgan Smith. A laptop belonging to Sean Mullins, the Journal’s Managing Editor, on which this article was written.

Having your hard work plagiarized can be demoralizing, although thieves probably can’t afford a Kraft single off of stolen website traffic. I imagine this won’t be the last time scrapers take from the next generation of Journal staffers, and while I strongly doubt I’ll pursue journalism myself after graduation, copyright theft is prevalent in all media fields. Fortunately, Bailey left me with words of encouragement from one writer to another.

“As tacky as it sounds, the best advice I can give is to not let it get to you. Understand that it does exist and there may be times you have to deal with it, but don’t blow it out of proportion. Remember, this isn’t personal, this is being done by bots, and it’s a problem with the internet itself, not you or your work,” Bailey said. “Just keep a pragmatic view, and you should be fine.”

Share this post

Facebooktwitterredditpinterestlinkedinmail
Managing Editor | + posts

Sean Mullins (she/they) is the managing editor and webmaster for the Journal, formerly the opinions editor during the 2021/2022 school year. She is a media studies major and professional writing minor at Webster University, but she's participated in student journalism since high school, having previously been a games columnist, blogger and cartoonist for the Webster Groves Echo at Webster Groves High School. Her passion is writing and editing stories about video games and other entertainment mediums. Outside of writing, Sean is also the treasurer for Webster Literature Club. She enjoys playing games, spending time with friends, LGBTQ+ and disability advocacy, streaming, making terrible puns and listening to music.

LEAVE A REPLY

Please enter your comment!
Please enter your name here