What do we really see? - The deep and the surface web

Dienstag, Januar 10th, 2006

The crawlers of Google, Yahoo, MSN and other search engine providers are automatically indexing the web. All of the web? No not all of it, just the surface which consists of billions of documents like HTML pages or directly linked file of any kind (i.e. mp3, PDF, doc, zip…). If we use complex search strings we are also able to plunge a little into the grey matter below the surface web (SW). Many documents are not directly linked but are still indexed.

The deep Web (DW), is web data that resides in databases and is only dynamically available in response to queries (i.e. you do a search on a specific website, login or load a website). It is supposedly much bigger and provides more valuable data than the surface web.

Bergman (2001) estimates that the DW contains 7,500 tb of data compared to 19 terabytes of data in the SW. Current Link analysis and crawler activity does not help to tap into those sources. It is much more complex and labor intensive and would probably exceed storage capabilities currently available to Google. A market exists as organisations such as the CIA, FBI or private companies have interests in using those additional (high quality)resources.

