Wednesday, February 13, 2008

Access forbidden. Attempted work detected.

The forbidden-content blocker at my worksite is more stupid than the one at your worksite. Doubt me? "Computing/Internet" is one of its forbidden categories. Blogs, another forbidden category, are the biggest problem because, as you know if you're reading this, most of the good code samples in existence these days are on blogs. We aren't allowed to appeal categories, only individual sites.

When I whined about it to my boss, he reminded me that I've done some work with web-scraping, and that I could probably write a Python script to make an exemption request for thousands of blogs at once, including ones I've never seen but might Google up someday. It'll benefit thousands of geeks across the organization, and maybe even convince somebody that they should turn the "stupid" dial down a notch.

I love my boss.

Here's the plan:
1. whitelist and blacklist of known legit and non-legit sites, respectively
2. count words. Words that appear much more frequently on whitelisted sites than on blacklisted are positive scorers, and vice versa.
3. apply the scoring vs. unknown sites. Hand-check a lot of results. Beef up the whitelists and blacklists and repeat until the scores against new sites are reliable.

Oddly enough, the hardest part of the process is gathering a long enough blacklist. I'm not just talking porn - I need harmless but non-work-related stuff, blogs where people are writing about their Irish setter instead of their SQL queries. (I need bunches of them - dozens, at least - so don't just comment to tell me, "My blog is useless and banal!")

One of the best sources I've found so far has been googling for "typical stupid blog".

4 comments:

Anonymous said...

Evelyn wants to remind you to unblock http://icanhascheezburger.com/

Sean

Unknown said...

Of course. It's a key source of documentation for LOLCode.

Anonymous said...

[sarcasm]Just blacklist on the keyword "microsoft".[/sarcasm]

I did a similar project for my AI course a few years ago. I ended up having a similar problem, I just didn't visit a large enough number of sites to get good data.

Ken Kuhlman said...

Won't something simple like searching del.icio.us for blog & "insert banal topic here" work? A few topics off the top of my head would be sports, stocks, vacation, travel, pets..

You should be able to get thousands of blacklist sites pretty easily.