The Deep Web
If you think that the Google, Yahoo!, and MSN search engines scour an already impressive expanse of the World Wide Web, think again.
In one of my favorite Communications of the ACM articles ever, Bin He, Mitesh Patel, Zhen Zhang, and Kevin Chen-Chuan Chang shed some light on the “Deep Web”, the huge chunk of the cyberspace that isn’t reached by search engines. This chunk consists of data in massive online databases, as opposed to static HTML pages that are easily crawled by the search engine spiders.
The CACM article (titled Accessing the Deep Web) cites figures from this whitepaper, which says, “The deep Web contains 7,500 terabytes of information compared to nineteen terabytes of information in the surface Web.” Whoa. We are missing out on a lot, then.
While the BrightPlanet whitepaper gave a good overview and introduction to the Deep Web, Bin He et al.’s article studied a million IP addresses to further quantify Deep Web data. It’s unfortunate that I can’t quote a huge chunk of the CACM article (digital reprinting requires permission/fee), but let me cite below some figures regarding the Deep Web coverage of the three big search engines. (This is the part that I really wanted to post about.)
Google: 32% of the entire Deep Web
Yahoo: 32%
MSN: 11%Total: 37%
The total is as such because Google’s and Yahoo’s coverages overlap a bit, and MSN’s coverage is a full subset of Yahoo’s. (What a loser. No wonder my rare MSN searches almost always turn out inferior to the other two engines’.)
In its conclusion, the article states that the traditional crawl-and-index technique might not work for online databases, and a new approach is needed. Needless to say, the curiosity of my inner Net geek (a separate entity from the lit geek, miniature model geek, and so on…) is very much aroused by the immensity of information just waiting to be tapped on the Web.
Related Posts:
- Google Gangsta Style
- Crimson Crux Revamped!
- Wi-Fi in the Philippines - We Were First!
- Delinquence…
- Hardcore Blogging About Programming







July 25th, 2007 at 12:11 am
Yeah, and of course there’s the problem of interoperability among these data sets. It would be so much easier if crucial information were shared among different websites, and so on.
My cousin asked me last Sunday if it was possible to create a medical tourism website that let you made reservations at a local hospital and paid online, as well as related features. She said someone already told her that the establishments had different ways of formatting and accessing their data so it would be somewhat impossible to accomplish. I said the government had to get into this effort. Sigh. It’s a great idea, but right now… it’d take ages to see it through.
July 25th, 2007 at 10:30 pm
@ia: about the tourism site, imo it just needs one site to be successful and set the standard. the rest will follow suit because there’s money involved hehe.
August 4th, 2007 at 5:39 am
Ia, unfortunately, different websites act as different companies (well, for most of the time they are different companies), and we know that different companies seldom share info amongst themselves
Garro, amen to that. Money would be the catalyst. But that one-site-to-be-successful would be a huge undertaking.