Archive for the ‘Deep Web’ Category

What do we really see? - The deep and the surface web

Dienstag, Januar 10th, 2006

The crawlers of Google, Yahoo, MSN and other search engine providers are automatically indexing the web. All of the web? No not all of it, just the surface which consists of billions of documents like HTML pages or directly linked file of any kind (i.e. mp3, PDF, doc, zip…). If we use complex search strings we are also able to plunge a little into the grey matter below the surface web (SW). Many documents are not directly linked but are still indexed.

The deep Web (DW), is web data that resides in databases and is only dynamically available in response to queries (i.e. you do a search on a specific website, login or load a website). It is supposedly much bigger and provides more valuable data than the surface web.

Bergman (2001) estimates that the DW contains 7,500 tb of data compared to 19 terabytes of data in the SW. Current Link analysis and crawler activity does not help to tap into those sources. It is much more complex and labor intensive and would probably exceed storage capabilities currently available to Google. A market exists as organisations such as the CIA, FBI or private companies have interests in using those additional (high quality)resources.

Related links:
Search engine trends, marketing
The Deep Web: Surfacing Hidden Value by Bergman (2001, U Michigan)
Search for the invisible web by Sherman (2001, The Guardian)
Index Structure for querying the Deep Web by Qiu/Shao/Zatsman/Shanmugasundaram (2003, Cornell U)
Accessing the Deep Web: A Survey by He/Patel/Zhang/Chen-Chuan Chang (2004, U Urbana-Champaign)
Deep Web Search engine
Introduction to the deep web by Laura Cohen (2005, SUNY Alabany)
Finding unpublished research by Mathews (2004, ACRL)