SEO (search engine optimization) and SEM (search engine marketing) purport to make sites search-engine-friendly so spiders can index your pages and find your content – fine. But often what masqueradesas SEO/SEM is nothing more than unscrupulous hosers filling legitmate sites with spammy links (splinks) to fake, ad-filled websites they’ve created to harvest click-advertisement money. This spam isn’t the email spam you already get – V1AGR4, etc. – but blog spam (splogs) and link-spam (splinks). Why? If you’ve heard of so-called “click-fraud”, then you’re one step ahead. Basically, these charlatans leech Google PageRank value from others and trick users into clicking banners and ads that pay pennies per click.
We deal with these SEO link-spammers on Feed Me Links all the time. Del.icio.us has the same problem.
Tom Coates once defined “social software” as, “things that get spam,” and he’s right. Feed Me Links has a semi-decent Google Page Rank (5), a strong Technorati ranking (1,545 w/977 blogs linking to us), and no Captcha required for signup or posting, which makes us a valuable attempted target for spammers to post spammy links to useless, ad-filled pages, increasing their Google Page rank. They do this by leveraging our Page Rank because the “pages” on Feed Me Links are aggressively interlinked, and if they can get their links to show up on the FML homepage, they get lots of “credibility” because so many pages link to the Feed Me Links homepage.
How Has Feed Me Links Fought Splinking?
Historically we’ve relied on simple collaborative flagging and regular database cleaning (we expire inactive users after 90 days, for example) to keep the system limber. We’ve also made some changes to the homepage, but I’ve resisted removing the “recent links” box (which splinkers exploit) simply because its a fun and useful feature for real users. But splinkers have gotten more aggressive, overwhelming legitimate users and diluting the value of our service. The time to strike is now, before the problem gets out of hand. But how to attack back?
The Double-Edged Sword of Turing Tests
I’ve avoided personhood-tests like captchas on Feed Me Links because they inconvenience legitimate users, they are trivially defeatable by a programmer, and they only prevent automated, bot-driven spam. My sneaking suspicion is that while botnets and automated spamming scripts are a real and chilling threat to online communities, paid humans punching in link after link constitute an equally scary threat to small online communities because they exhibit many of the same patterns as real users because they are real users (just not the users you want!) I’ve also avoided attempts to automatically detect splinkers, for two reasons:
the original Feed Me Links core users are an opinionated group of people, most of whom are friends who i’m loathe to accidentally insult by mis-categorizing any of them as spammers. (we run a harsh, zero tolerance anti-spammer policy). Mistrusting your own users is a cardinal sin in community building.
Compared to emails or blog comments, splinks appear to have relatively few pieces of information by which to measure their “spamminess”. Emails have message headers, the message body, the sender, recipient, etc., but links are pretty simple – just a link and some tags. So I wasn’t sure there would be enough information contained within a single link for a computer to successfully identify it as spam or ham (ham is what spam detectors call legitimate content :-)
In the spirit of full disclosure, I will be “opening the kimono” to reveal my ongoing splink detection experiments. These are experiments in progress and I do not purport to be an expert in automated spam detection. No dolphins were harmed during the writing of this code. Use only as directed. Void where prohibited by your mom.
So how do we detect splinking?
In short: by looking for splinking behavior patterns at the USER level, not the link level, so we can begin to discover patterns over time. Since posting anything on Feed Me Links involves logging in, all actions can be tied back to the user account and from that, patterns can be detected. (we “discourage” creating multiple accounts from the same IP address).
Start by examining prior art
Let’s first review how Email spam is detected and then use those principles to look for splinkers. Spam filters like Spam Assassin use a technique called Naive-Bayesian filtering to “learn” what spam looks like(star). These programs run in three phases:
Write a set of heuristics to measure spam-likeliness of a single message. Each heuristic test evaluates to a number. (there are the spamassassin heuristics, for example, things like: Subject starts with “Hello”, Subject contains “Your Bills”, To: address appears in Subject, etc.)
Train the software on a set of known good and known spam messages. Feed in a set of messages with the corresponding heuristic scores, and, based on a large number of messages, the program will “learn” which heuristics and what values contribute most to the probability that a message is spam.
Ask the software to predict the probability that a new message is spam based on the heuristic numbers of the new message and the historical knowledge gained in the training phase * (Apple’s Mail.app uses a different technique called LSA or Latent Semantic Analysis). Apply customized heuristics Phases 2 and 3 are easy because you can download a working Naive Bayesian classifier and start playing with it without writing any mathematics code (i used Ken William’s Algorithm:NaiveBayes from CPAN). But phase 1, the heuristics – that’s the interesting part because the heuristics are specific to the type of content you’re filtering / classifying. As a website owner, you need to identify the hallmarks of your splinkers in a programmatic way. I identified ten criteria I could measure which seemed like good indicators of splinking. Without further ado:
These are the splink-detection heuristics from Feed Me Links:
- Domain_duplicity_index - repetitiveness of the domains you link to
- Dot_info_domain_index - percentage of domains you link to ending in .info
- Double_hyphenated_link_index - percentage of domains with two or more hyphens
- Freshness_index - age of the user on site as a percentage of the oldest user’s age
- Known_spammer_domain_index - percentage of domains linked to containing, among others, .blogspot.com*
- Tags_to_links_ratio - ratio of tags to links (spammers tend to overtag and rarely re-use tags)
- Tags_w_commas_index - percentage of your tag names that contain commas
- Tags_w_multiple_spaces_index - *percentage of your tag names containing more than one space *
- Userid_w_double_digits_index - does your username contain more than two digits
- Volume_index - how many links do you have?
Note that I don’t actually know which of these tests is the best indicator of splinking behavior. These may not be all the indicators, or even the best indicators. They are, however, a place to start. The way Naive-Bayesian classifiers work means it’s easy to add new heuristics in the future and re-train the classifier. [Spamassassin adds new tests all the time, for example.]
Over the last several months, I amassed a database of almost 2,000 known splinkers with historical link and tag data, collected by users voluntarily flagging splinkers and by my, um, manual detection methods. I then wrote code to calculate the above ten heuristics for any user, and to export the results as XML. Feeding in my list of known splinkers, with their heuristics, means I can train my Bayesian Classifier to recognize splinker behavior.
The Mis-Education of a Splink Filter
Unfortunately, I didn’t have a similar database of known GOOD users, so at first the classifier only understood splinkers and spam. With the initial current dataset of 1,970 splinkers, it had only ever seen mayhem and abuse! It had never known peace! Due to its rough childhood, like a dog that’s been kicked, it has issues correctly identifying friends from foes. Notice that in the example below my code thinks I’m a splinker:
Chaffing and Winnowing
To build a quick and dirty list of good users, I used some coding judo to select the list of everyone who’s listed as a friend (“peep”) by someone else, then manually checked each user on the list to make sure they weren’t splinkers. That exercise gave me a list of about 100 known good powerusers to use as a sample training file. Kinda sucked, but you have to start somewhere.
Action Items? Next Steps?
Right now the learner isn’t directly connected to the live database, so actual users aren’t being checked against it in real time. I just finished the coding and I’m not especially confident in it’s ability to tell right from wrong, yet. Here’s what needs to happen next:
Connect user-flagging to a recorded heuristic (currently, flagged users just get their account link sent to the admins for manual checking). That way, user flagging will directly influence the learner’s tendency to view them as splinkers.
Collect a bigger list of known good users to smarten up the classifier.
Run the classifier nightly against the live database for all users created within the last 7 days and “nonaggressively flag” any splinkers found. This function should send users an email saying “You’ve been flagged” without actually deleting their account (in case of errors in the test), or else forward the list of flagged accounts to admins for moderation. [any ideas here?]
If any legitimate users get flagged, they will (hopefully) respond to the email and we can try to adjust the heuristics. If you’d like to follow the technical bumpings and grindings of this ongoing process, you can subscribe to the Feed Me Links source code commit feed at Tools to Make Tools, which updates each time I check in new files. View the specific code and scripts covered in this article at: Tools to Make Tools [changeset 1308]