Tag Archives: spammity spam

Bot or Not?

Note: This entry has been restored from old archives.

Stick a form on the web and within a few days it’s getting hammered. My recently added “comment” form started getting a few posts, all to the “Comments” entry (web-search anyone?). So I added email registration… could have saved myself time by just not adding the silly “comment” ability in the first place. The registration was just in time, a day later the flood started. Maybe not much of a flood by big-web standards, it must be scary to be a popular website! In the last 12 hours I’ve had just over 1000 POSTs of the comment form.

They’re not hitting the “Comments” entry much anymore either. The breakdown is:

      1 Food/Ristretto/The_Coffee_House_on_Watford_High_Street.html
      9 Technology/General/Comments.html
     46 Food/Cooking/Spinach_Pasta.html
     47 Random/Riverbank_Teahouse.html
     48 Food/Eating/England/London/Gourmet_Burger_Kitchen.html
     49 Technology/Code/awk_awk_awk_.html
     52 Technology/Code/Just_Like_Uni.html
     53 Technology/General/Flashy_Shite.html
     54 Food/Cooking/Lime_Poached_Chicken.html
     55 Food/Eating/England/London/The_Neal_Street_Restaurant.html
     59 Technology/General/EZRSSFeeds_and_other_WebSuckers.html
     59 Technology/General/Still_Doesn_t_Like_Kaspersky.html
     67 Random/Flip_Out_Like_A_Ninja.html
     69 Random/BAA_BAA_Whisky.html
     70 Random/Birdflu.html
     83 Health/Your_Back_Needs_Debugging.html
     84 Food/Ristretto/Caffe_Vergnano_1882_on_Charing_Cross_Road.html
     87 Health/Beerolies.html
     98 Technology/General/Collateral_Damage__An_Unintentional_Storm_Worm_DOS.html

Why these pages? Not quite sure, but they co-incide with pages that get the most Google hits, maybe that’s it. I’ve collected data on the form postings, the primary aim being to capture whether or not real people were behind the postings. Now, I thought this highly unlikely, looked like bot activity to me. But there is a lot of “they just get cheap people in China to fill in forms” going around that I thought I’d try something… I’ve added javascript to the page that records all keystrokes and mouse activity to a log that is sent to my web server when the form is submitted. This was fun, and neat, for an example have a look at keylog.html.

The end result of this little exercise is that I seem to have confirmed my opinion that there are no “real” people involved here. This isn’t representative of course, my site is tiny, unimportant, and doesn’t employ CAPTCHAs. If anything I’m a very unlikely target of such attention. Further, there are two ways to disable my logging:

  1. Cache the form and present from some “form filling” tool (unlikely).
  2. Have javascript disabled (duh).

I classify the first as highly improbable. I classify the second as not being the case for my forms since I’ve started getting submissions with spam data filled into hidden fields.

It would have been much more interesting to pick up some key logs! But the effort has revealed interesting data regardless.

  1. After changing the form the new fields didn’t show up in POSTs so that POSTer (a bot) responsible has cached the form (or form params at any rate).
  2. There was a delay of only one hour between the form change and the first new spam post with the new fields. Of 1000 POSTs in the next 12 hours only 10 were for the new form. Most current POSTs are still using the old form fields.
  3. Nine of the new-form hits were for the same page (Technology/General/Comments.html), so first hit from a new crawl of the form-snaffling bot I take it.
  4. Just one was for Food/Ristretto/The_Coffee_House_on_Watford_High_Street.html, and this is a very different POST from all the others (spammy random URL and random-letter “words”, while all others all have real “English” word secuences).
  5. Reflecting back on the access logs it looks like POSTs are usually preceded by GETs to the correct URLs and the GET has no referrer (related: in the same period there are 4 hits to the page by MSIE variants with no referrers and no other hits from the same IP, the spider maybe? Two of the IE UA strings are just broken looking.)
  6. The “url” field is always filled in with a “http://…” URL.
  7. Across all 1000+ posts only 33 URLs are used. These are not evenly distributed, with about 5 around the 100 mark, 27 below 25 (10 are a single occurrences), and 7 in the 25-80 range.
  8. A total of 148 IPs source the POSTS, many make only 1 or 2 POSTs, 22 make beteen 10 and 50, 5 between 50 and 100, and one makes 127 POSTS (submitting 15 URLs with very uneven distribution).
  9. Five URLs appear to be a typos with “hyml” rather than “html”, but I’m not giving them the satisfaction of a hit to find out for sure. It might just be an obfuscation attempt. Of these possible typos three are the three most submitted URLs.
  10. 40 “name”s are used and 35 “title”s, these usually are filled in with identical data, and usually related to the obvious subject of the URL.
  11. Most spamvertising is for drug names (I recognise “viagra” but the rest mean little to me: “levitra”, “ambien”, “xanax”, “cialis”), next most popular is gaming/casinos (including the most spammed URL), finally there’s porn (comparatively infrequent).
  12. The “comment” field is usually filled in with some supposedly complimentary text, and only contains URLs in two cases.

I’ll leave the observations at that, more interesting would be to draw relationships between the different field content, inspection doesn’t show any obvious patterns and I don’t have time to dig deeper. The frequency of comment content is:

      1 comment:[[URLS REMOVED]]
      1 comment:good post man [[URLS REMOVED]]
      1 comment:so many interesting [[URLS REMOVED]]
      1 comment:yujlh lzqfe heug xsjepcl dljfugw axiwrlbcm visf
      6 comment:Hello, nice site look this:
     44 comment:Good design!
     48 comment:Great work!
     49 comment:Pretty much nothing seems important.
     50 comment:Good site. Thank you.
     50 comment:I like your site very much indeed.
     51 comment:Great site! Beautiful craftsmanship!! Keep of the wonderful work!!
     52 comment:Nice site
     53 comment:Cool site. Thank you!
     53 comment:Hello, very nice site!
     53 comment:TARRIFIC SITE!
     53 comment:Thank you!
     55 comment:Hi, nice site
     56 comment:Well done!
     57 comment:very interesting fix links
     60 comment:Nice site. Thanks.
     61 comment:I feel like a bunch of nothing.
     61 comment:I just don't have anything to say.
     64 comment:Cool site. Thank you:-)
     64 comment:Excellent web site. I will visit it often.
     69 comment:Nice site. Thanks!

We’ve all seen “Nice site. Thanks!” on blogs all over the ‘net. My favourite is “I feel like a bunch of nothing.”, makes me feel sorry for some poor depressed zombie machine somewhere. The fourth one, “yujlh…” is from the only POST that looks completely unlike all the others, a URL submitted but with all other fields meaningless character sequences.

My feeling is that this is the “new spam”, though maybe not so new just harder to measure. Why try to push to victims through email, which is rapidly loosing the peoples’ trust, when you can focus real effort to simultaneously getting the word spread all over the ‘net and push search-ranking juice to these pages? Does this really work? Seems unlikely, but I’ve never been able to get my head around the fact that spam is actually effective … it takes all kinds of stupid to make a society.

They say that email spam is declining (but people like to say that every few months, then there’s another surge) so maybe the resources are going into this instead. The next question is the source? I think it is probably clear that this is the work of a bot-net, do we think Storm? Who’s paying them? Maybe the URLs are actually

There’s been 100 new POSTs since I started writing this (one hour ago).

What can we do about this? The solution seems simple. Guard web forms appropriately! CAPTCHAs are popular, but requiring login/registration may be better. Mark all URLs as “nofollow” to kill any hopes of search-state inflation (or don’t allow URLs if they can be avoided). The simplicity is probably misleading though, this flood against my little site is unsophisticated and this is probably the case because this is all that’s needed to post to so many blog type sites. If bloggers raise the bar the bot herders will just jump higher. Depressing isn’t it? The continued lack of any real solutions against malware and spam often makes me “feel like a bunch of nothing”, to quote one of the bots.

Leftovers, some more stats:

User agents:

      1 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)
      1 HTTP_USER_AGENT:Xrqhgdfzi sipmvr zqboirha
      3 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
      6 HTTP_USER_AGENT:User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2)
     54 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; http://www.tropicdesigns.net)
     63 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
     75 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
    105 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
    116 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Maxthon)
    129 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.1)
    147 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt; MRA 4.0 (build 00768))
    157 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)
    327 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)

Intriguing list of vias and proxies:

      1 HTTP_VIA:1.1 MCRDSC, 1.0 kiwi.khi.wol.net.pk:3128 (squid/2.5.STABLE7)
      1 HTTP_VIA:1.1 TTCache03 (Jaguar/3.0-59)
      1 HTTP_VIA:1.1 barracuda.lcps.k12.nm.us:8080 (http_scan/
      1 HTTP_VIA:1.1 fiorillinet:3128 (squid/2.6.STABLE7)
      1 HTTP_VIA:1.1 firewall.seduc.ro.gov.br:3128 (squid/2.5.STABLE6)
      1 HTTP_VIA:1.1 localhost.localdomain
      1 HTTP_VIA:1.1 localhost:3128 (squid/2.5.STABLE14)
      1 HTTP_VIA:1.1 mirage.certelnet.com.br:3128 (squid/2.5.STABLE14)
      1 HTTP_VIA:1.1 none:8080 (Topproxy-2.0/)
      1 HTTP_VIA:1.1 proxy:3128 (squid/2.5.STABLE11)
      1 HTTP_VIA:1.1 sexto.fmetsia.upm.es
      2 HTTP_PROXY_AGENT:Sun-Java-System-Web-Proxy-Server/4.0.3
      2 HTTP_VIA:1.0 allserver.all-milwaukee.org:3128 (squid/2.6.STABLE16)
      2 HTTP_VIA:1.1 i187340:3128 (KEN!)
      2 HTTP_VIA:1.1 proxy-server1
      2 HTTP_VIA:1.1 server2.buffalowelding.com:3120 (squid/2.5.STABLE13)
      4 HTTP_VIA:1.1 FLASH:3128 (squid/2.6.STABLE16-20071117)
      4 HTTP_VIA:1.1 MCRDSC, 1.0 cherry.khi.wol.net.pk:3128 (squid/2.5.STABLE7)
      5 HTTP_VIA:1.1 MCRDSC, 1.0 mango.khi.wol.net.pk:3128 (squid/2.5.STABLE7)
      5 HTTP_VIA:1.1 MCRDSC, 1.0 pear.khi.wol.net.pk:3128 (squid/2.5.STABLE7)
      6 HTTP_VIA:1.0 HAVP
      7 HTTP_VIA:1.1 ISAFW
      8 HTTP_VIA:1.1 ppr-cache1 (NetCache NetApp/6.1.1D2)
     12 HTTP_VIA:1.1 FGMAIN2
     14 HTTP_VIA:1.1 ndb-bau02:3128 (KEN!)
     21 HTTP_VIA:1.1 proxy.net:3128 (squid/2.6.STABLE13)
     24 HTTP_VIA:1.1 microcon-serv3:3128 (KEN!)
     39 HTTP_VIA:1.1 PRINTER
     65 HTTP_VIA:1.1 admin:3128 (squid/2.6.STABLE9)
     97 HTTP_VIA:1.1 gtw1.ciberpoint.com.br:3128 (squid/2.6.STABLE13)

(Interesting to note that some companies here are effectively giving out details about how their internal web clients are scanned at the gateway. Some of this could be enough to expose the existence of vulnerable infrastructure software or help whittle down the list of software you need to check your targeted malware with. Not good practice.)

I’m at risk!

Note: This entry has been restored from old archives.

According to SiteAdvisor I’m at risk! Oh no!

Rating: You’re at risk!
Watch out! Your inbox might explode! Your decisions would have resulted in your
inbox being filled with approximately 2000 spammy e-mails per week. But who can
blame you? It’s often very hard to tell which sites will respect your personal
information. …

Okay, so not quite so terrible. There’s just one flaw with this quiz and that’s the fact that I would never, ever give any email address to any of the sites and thus am actually at zero risk 😉

I believe the most important point here is missed or at least not well conveyed: that based purely on visuals (they also link to the privacy policies, but who reads them?) you simply can’t tell whether or not a site is going to be a source of spam. That’s why something like SiteAdvisor has value, you just can’t know how bad a site will be until you try and the premise of SiteAdvisor is that they do the trying for you. A very good tool for those who run around the web throwing their email address around like a popular STD (probably most people)… though I have to wonder who’s going to convince people who have bad habits to start with to download an Internet condom? Anyone “in the know” should see it as their duty to spread the word: condoms are good, they’ll protect your box from strange gunk. Though this is more like some kind of registered paedophile list than a general purpose preventative.

I do realise the whole thing is probably an engineered marketing campaign with sites carefully selected for their lack of intuitive ‘spamminess’ clues and that they can probably typically expect a result of 50% (choose the sample comparisons well enough of course and you can swing this either way). The main point is probably avoided since it is more effective to make individuals feel that their personal inadequacies require patching up (taking a lesson from the spammers, this is why penile enlargement spam is still worthwhile enough to continue to be such a problem after all these years, there’s an inexhaustible supply of personal inadequacy out there fuelling the world of misplaced hope otherwise known as advertising^Wspam).

What will protect us from the unexpected though, such as my recent AllOfMP3/ChronoPay experience? Both legitimate online businesses with apparently clean privacy records (okay, so one of them looses points for being Russian) and not a peep of spam after more than a year of use and them wham, I have more bestiality and incest in my inbox than I can handle. Probably a security breach, either technical or most likely a low-paid employee lured by some extra cash. Importantly: this can not be detected in advance. So while SiteAdvisor is likely an effective approach to mitigate the spam deluge we’re not quite seeing the end of reactive AntiSpam software just yet; as much as I wish it could be so. I’ve used SiteAdvisor on one of my machines for a while though and do find the results interesting, if not typically much use to a user like myself (the SiteAdvisor Firefox plugin’s marking of Google results as good|bad is nifty, interesting that Google came out not long afterwards with the same idea built-in; SiteAdvisor is still at the advantage because it is there in your taskbar all the time).

If you’re like me you own your own domains and if forced to give some site an address they get their very own unique one – this has two great advantages: 1) You can block that email when it starts getting spam; 2) You know who was responsible for spamming you or leaking your address. I must admit that it would be nice not to have such a level of complexity required to “manage spam”.

And on a related note I’m sad to see that one of the two remaining spam blocklists that I consider safe to use at an SMTP rejection level looks like it could end up being the victim of further proof of USAian litigative idiocy. The two I still use are: list.dsbl.org, sbl.spamhaus.org.