Bot or Not?

Note: This entry has been restored from old archives.

Stick a form on the web and within a few days it’s getting hammered. My recently added “comment” form started getting a few posts, all to the “Comments” entry (web-search anyone?). So I added email registration… could have saved myself time by just not adding the silly “comment” ability in the first place. The registration was just in time, a day later the flood started. Maybe not much of a flood by big-web standards, it must be scary to be a popular website! In the last 12 hours I’ve had just over 1000 POSTs of the comment form.

They’re not hitting the “Comments” entry much anymore either. The breakdown is:

      1 Food/Ristretto/The_Coffee_House_on_Watford_High_Street.html
      9 Technology/General/Comments.html
     46 Food/Cooking/Spinach_Pasta.html
     47 Random/Riverbank_Teahouse.html
     48 Food/Eating/England/London/Gourmet_Burger_Kitchen.html
     49 Technology/Code/awk_awk_awk_.html
     52 Technology/Code/Just_Like_Uni.html
     53 Technology/General/Flashy_Shite.html
     54 Food/Cooking/Lime_Poached_Chicken.html
     55 Food/Eating/England/London/The_Neal_Street_Restaurant.html
     59 Technology/General/EZRSSFeeds_and_other_WebSuckers.html
     59 Technology/General/Still_Doesn_t_Like_Kaspersky.html
     67 Random/Flip_Out_Like_A_Ninja.html
     69 Random/BAA_BAA_Whisky.html
     70 Random/Birdflu.html
     83 Health/Your_Back_Needs_Debugging.html
     84 Food/Ristretto/Caffe_Vergnano_1882_on_Charing_Cross_Road.html
     87 Health/Beerolies.html
     98 Technology/General/Collateral_Damage__An_Unintentional_Storm_Worm_DOS.html

Why these pages? Not quite sure, but they co-incide with pages that get the most Google hits, maybe that’s it. I’ve collected data on the form postings, the primary aim being to capture whether or not real people were behind the postings. Now, I thought this highly unlikely, looked like bot activity to me. But there is a lot of “they just get cheap people in China to fill in forms” going around that I thought I’d try something… I’ve added javascript to the page that records all keystrokes and mouse activity to a log that is sent to my web server when the form is submitted. This was fun, and neat, for an example have a look at keylog.html.

The end result of this little exercise is that I seem to have confirmed my opinion that there are no “real” people involved here. This isn’t representative of course, my site is tiny, unimportant, and doesn’t employ CAPTCHAs. If anything I’m a very unlikely target of such attention. Further, there are two ways to disable my logging:

  1. Cache the form and present from some “form filling” tool (unlikely).
  2. Have javascript disabled (duh).

I classify the first as highly improbable. I classify the second as not being the case for my forms since I’ve started getting submissions with spam data filled into hidden fields.

It would have been much more interesting to pick up some key logs! But the effort has revealed interesting data regardless.

  1. After changing the form the new fields didn’t show up in POSTs so that POSTer (a bot) responsible has cached the form (or form params at any rate).
  2. There was a delay of only one hour between the form change and the first new spam post with the new fields. Of 1000 POSTs in the next 12 hours only 10 were for the new form. Most current POSTs are still using the old form fields.
  3. Nine of the new-form hits were for the same page (Technology/General/Comments.html), so first hit from a new crawl of the form-snaffling bot I take it.
  4. Just one was for Food/Ristretto/The_Coffee_House_on_Watford_High_Street.html, and this is a very different POST from all the others (spammy random URL and random-letter “words”, while all others all have real “English” word secuences).
  5. Reflecting back on the access logs it looks like POSTs are usually preceded by GETs to the correct URLs and the GET has no referrer (related: in the same period there are 4 hits to the page by MSIE variants with no referrers and no other hits from the same IP, the spider maybe? Two of the IE UA strings are just broken looking.)
  6. The “url” field is always filled in with a “http://…” URL.
  7. Across all 1000+ posts only 33 URLs are used. These are not evenly distributed, with about 5 around the 100 mark, 27 below 25 (10 are a single occurrences), and 7 in the 25-80 range.
  8. A total of 148 IPs source the POSTS, many make only 1 or 2 POSTs, 22 make beteen 10 and 50, 5 between 50 and 100, and one makes 127 POSTS (submitting 15 URLs with very uneven distribution).
  9. Five URLs appear to be a typos with “hyml” rather than “html”, but I’m not giving them the satisfaction of a hit to find out for sure. It might just be an obfuscation attempt. Of these possible typos three are the three most submitted URLs.
  10. 40 “name”s are used and 35 “title”s, these usually are filled in with identical data, and usually related to the obvious subject of the URL.
  11. Most spamvertising is for drug names (I recognise “viagra” but the rest mean little to me: “levitra”, “ambien”, “xanax”, “cialis”), next most popular is gaming/casinos (including the most spammed URL), finally there’s porn (comparatively infrequent).
  12. The “comment” field is usually filled in with some supposedly complimentary text, and only contains URLs in two cases.

I’ll leave the observations at that, more interesting would be to draw relationships between the different field content, inspection doesn’t show any obvious patterns and I don’t have time to dig deeper. The frequency of comment content is:

      1 comment:[[URLS REMOVED]]
      1 comment:good post man [[URLS REMOVED]]
      1 comment:so many interesting [[URLS REMOVED]]
      1 comment:yujlh lzqfe heug xsjepcl dljfugw axiwrlbcm visf
      6 comment:Hello, nice site look this:
     44 comment:Good design!
     48 comment:Great work!
     49 comment:Pretty much nothing seems important.
     50 comment:Good site. Thank you.
     50 comment:I like your site very much indeed.
     51 comment:Great site! Beautiful craftsmanship!! Keep of the wonderful work!!
     52 comment:Nice site
     53 comment:Cool site. Thank you!
     53 comment:Hello, very nice site!
     53 comment:TARRIFIC SITE!
     53 comment:Thank you!
     55 comment:Hi, nice site
     56 comment:Well done!
     57 comment:very interesting fix links
     60 comment:Nice site. Thanks.
     61 comment:I feel like a bunch of nothing.
     61 comment:I just don't have anything to say.
     64 comment:Cool site. Thank you:-)
     64 comment:Excellent web site. I will visit it often.
     69 comment:Nice site. Thanks!

We’ve all seen “Nice site. Thanks!” on blogs all over the ‘net. My favourite is “I feel like a bunch of nothing.”, makes me feel sorry for some poor depressed zombie machine somewhere. The fourth one, “yujlh…” is from the only POST that looks completely unlike all the others, a URL submitted but with all other fields meaningless character sequences.

My feeling is that this is the “new spam”, though maybe not so new just harder to measure. Why try to push to victims through email, which is rapidly loosing the peoples’ trust, when you can focus real effort to simultaneously getting the word spread all over the ‘net and push search-ranking juice to these pages? Does this really work? Seems unlikely, but I’ve never been able to get my head around the fact that spam is actually effective … it takes all kinds of stupid to make a society.

They say that email spam is declining (but people like to say that every few months, then there’s another surge) so maybe the resources are going into this instead. The next question is the source? I think it is probably clear that this is the work of a bot-net, do we think Storm? Who’s paying them? Maybe the URLs are actually

There’s been 100 new POSTs since I started writing this (one hour ago).

What can we do about this? The solution seems simple. Guard web forms appropriately! CAPTCHAs are popular, but requiring login/registration may be better. Mark all URLs as “nofollow” to kill any hopes of search-state inflation (or don’t allow URLs if they can be avoided). The simplicity is probably misleading though, this flood against my little site is unsophisticated and this is probably the case because this is all that’s needed to post to so many blog type sites. If bloggers raise the bar the bot herders will just jump higher. Depressing isn’t it? The continued lack of any real solutions against malware and spam often makes me “feel like a bunch of nothing”, to quote one of the bots.

Leftovers, some more stats:

User agents:

      1 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)
      1 HTTP_USER_AGENT:Xrqhgdfzi sipmvr zqboirha
      3 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
      6 HTTP_USER_AGENT:User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2)
     54 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
     63 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
     75 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
    105 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
    116 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Maxthon)
    129 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.1)
    147 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt; MRA 4.0 (build 00768))
    157 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)
    327 HTTP_USER_AGENT:Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)

Intriguing list of vias and proxies:

      1 HTTP_VIA:1.1 MCRDSC, 1.0 (squid/2.5.STABLE7)
      1 HTTP_VIA:1.1 TTCache03 (Jaguar/3.0-59)
      1 HTTP_VIA:1.1 (http_scan/
      1 HTTP_VIA:1.1 fiorillinet:3128 (squid/2.6.STABLE7)
      1 HTTP_VIA:1.1 (squid/2.5.STABLE6)
      1 HTTP_VIA:1.1 localhost.localdomain
      1 HTTP_VIA:1.1 localhost:3128 (squid/2.5.STABLE14)
      1 HTTP_VIA:1.1 (squid/2.5.STABLE14)
      1 HTTP_VIA:1.1 none:8080 (Topproxy-2.0/)
      1 HTTP_VIA:1.1 proxy:3128 (squid/2.5.STABLE11)
      1 HTTP_VIA:1.1
      2 HTTP_PROXY_AGENT:Sun-Java-System-Web-Proxy-Server/4.0.3
      2 HTTP_VIA:1.0 (squid/2.6.STABLE16)
      2 HTTP_VIA:1.1 i187340:3128 (KEN!)
      2 HTTP_VIA:1.1 proxy-server1
      2 HTTP_VIA:1.1 (squid/2.5.STABLE13)
      4 HTTP_VIA:1.1 FLASH:3128 (squid/2.6.STABLE16-20071117)
      4 HTTP_VIA:1.1 MCRDSC, 1.0 (squid/2.5.STABLE7)
      5 HTTP_VIA:1.1 MCRDSC, 1.0 (squid/2.5.STABLE7)
      5 HTTP_VIA:1.1 MCRDSC, 1.0 (squid/2.5.STABLE7)
      6 HTTP_VIA:1.0 HAVP
      7 HTTP_VIA:1.1 ISAFW
      8 HTTP_VIA:1.1 ppr-cache1 (NetCache NetApp/6.1.1D2)
     12 HTTP_VIA:1.1 FGMAIN2
     14 HTTP_VIA:1.1 ndb-bau02:3128 (KEN!)
     21 HTTP_VIA:1.1 (squid/2.6.STABLE13)
     24 HTTP_VIA:1.1 microcon-serv3:3128 (KEN!)
     39 HTTP_VIA:1.1 PRINTER
     65 HTTP_VIA:1.1 admin:3128 (squid/2.6.STABLE9)
     97 HTTP_VIA:1.1 (squid/2.6.STABLE13)

(Interesting to note that some companies here are effectively giving out details about how their internal web clients are scanned at the gateway. Some of this could be enough to expose the existence of vulnerable infrastructure software or help whittle down the list of software you need to check your targeted malware with. Not good practice.)