The advent of the Internet marked the beginning of the information society, by giving us access to an unprecedented amount of data and information. But as the Internet is growing, more and more information is becoming out of reach of large-scale, general purpose web search engines. There are alternatives such as web directories and specialized search engines targeting a specific domain, as well as envisaged alternatives like the semantic web. Each alternative has its own representation of web data and each has its own strengths and weaknesses. Combining these representations in a single framework has the potential to provide very accurate and focused search of web data.
The EfFoRT project will develop an approach for combining multiple representations of web information in a common framework based on statistical language models. In this framework, it will be possible, for example, to derive models of the actual language-use of web pages to distinguish between arts, business, entertainment, education, etc. Similarly, it will be possible to derive models of the structure of web pages to distinguish between blogs, FAQs, personal web pages, cultural heritage pages, etc. The envisaged techniques have to be robust to all kinds of errors, ranging from imperfect information extraction techniques to imprecise queries formulated by the average web search engine user. An important aspect of any new technique in a web-setting is that they have to scale up to terabyte-sized collections. We will develop so-called parsimonious models to derive compact representation and to handle dependencies between representations of the data.
Concretely, the EfFoRT project has the following research aims: