Archive for Januar, 2006

Follow up: Google bombs and the autonomy of search engine vendors

Montag, Januar 30th, 2006

In my entry on Google bombs on 11/19/2005 I raised the following question:

“How will governments react to such movements of altering the search results in an unfavorable way in the future as knowledge becomes more important? How will search engine providers react? The easiest way to approach this would be to influence or enforce rules on search engine vendors. Hence, we could ask whether search engine providers need to be kept as autonomous as central banks with respect to knowledge?”

Well, as of 1/25/2006 we got an answer to this when reports on Google’s self-censored search engine for China came out. However, as a other reports show, censorship also exists in other countries like Germany or France for certain terms. So in fact there is a need to watch developments in this regard carefully…What do you think or propose?

Related articles:
Harvard Law School, Berkman Center for Internet & Society
NY Times on Google and China search engine version
Wired on Google and their geolocations on searches
NY Times on Google and Privavcy
Newscientist on China and Google search
Washington Post on Geolocator

What do we really see? - The deep and the surface web

Dienstag, Januar 10th, 2006

The crawlers of Google, Yahoo, MSN and other search engine providers are automatically indexing the web. All of the web? No not all of it, just the surface which consists of billions of documents like HTML pages or directly linked file of any kind (i.e. mp3, PDF, doc, zip…). If we use complex search strings we are also able to plunge a little into the grey matter below the surface web (SW). Many documents are not directly linked but are still indexed.

The deep Web (DW), is web data that resides in databases and is only dynamically available in response to queries (i.e. you do a search on a specific website, login or load a website). It is supposedly much bigger and provides more valuable data than the surface web.

Bergman (2001) estimates that the DW contains 7,500 tb of data compared to 19 terabytes of data in the SW. Current Link analysis and crawler activity does not help to tap into those sources. It is much more complex and labor intensive and would probably exceed storage capabilities currently available to Google. A market exists as organisations such as the CIA, FBI or private companies have interests in using those additional (high quality)resources.

Related links:
Search engine trends, marketing
The Deep Web: Surfacing Hidden Value by Bergman (2001, U Michigan)
Search for the invisible web by Sherman (2001, The Guardian)
Index Structure for querying the Deep Web by Qiu/Shao/Zatsman/Shanmugasundaram (2003, Cornell U)
Accessing the Deep Web: A Survey by He/Patel/Zhang/Chen-Chuan Chang (2004, U Urbana-Champaign)
Deep Web Search engine
Introduction to the deep web by Laura Cohen (2005, SUNY Alabany)
Finding unpublished research by Mathews (2004, ACRL)