Items where author is affiliated with Simon Fraser University
Number of items: 2.
and He, Xiaofei
and Wang, Can
and Pei, Jian
and Bu, Jiajun
and Chen, Chun
and Guan, Ziyu
and Gang, Lu News Article Extraction with Template-Independent Wrapper.
We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieved 98.1% accuracy over 3,973 news pages from 12 news sites.
and Jiang, Daxin
and Pei, Jian
and Chen, Enhong
and Li, Hang Towards Context-Aware Search by Learning a Very Large Variable Length Hidden Markov Model from Search Logs.
Capturing the context of a user’s query from the previous queries and clicks in the same session may help understand the user’s information need. A context-aware approach to document re-ranking, query suggestion, and URL recommendation may improve users’ search experience substantially. In this paper, we propose a general approach to context-aware search. To capture contexts of queries, we learn a variable length Hidden Markov Model (vlHMM) from search sessions extracted from log data. Although the mathematical model is intuitive, how to learn a large vlHMM with millions of states from hundreds of millions of search sessions poses a grand challenge. We develop a strategy for parameter initialization in vlHMM learning which can greatly reduce the number of parameters to be estimated in practice. We also devise a method for distributed vlHMM learning under the map-reduce model. We test our approach on a real data set consisting of 1.8 billion queries, 2.6 billion clicks, and 840 million search sessions, and evaluate the effectiveness of the vlHMM learned from the real data on three search applications: document re-ranking, query suggestion, and URL recommendation. The experimental results show that our approach is both effective and efficient.
About this site
This website has been set up for WWW2009 by Christopher Gutteridge of the University of Southampton, using our EPrints software.
Add your Slides, Posters, Supporting data, whatnots...
If you are presenting a paper or poster and have slides or supporting material you would like to have permentently made public at this website, please email
firstname.lastname@example.org - Include the file(s), a note to say if they are presentations, supporting material or whatnot, and the URL of the paper/poster from this site. eg. http://www2009.eprints.org/128/
It's impractical to add all the workshops at WWW2009 by hand, but if you can provide me with the metadata in a machine readable way, I'll have a go at importing it. If you are good at slinging XML, my ideal import format is visible at http://www2009.eprints.org/import_example.xml
We (Southampton EPrints Project) intend to preserve the files and HTML pages of this site for many years, however we will turn it into flat files for long term preservation. This means that at some point in the months after the conference the search, metadata-export, JSON interface, OAI etc. will be disabled as we "fossilize" the site. Please plan accordingly. Feel free to ask nicely for us to keep the dynamic site online longer if there's a rally good (or cool) use for it...
- WWW2009 EPrints supports OAI 2.0 with a base URL of http://www2009.eprints.org/cgi/oai2
- The JSON URL is http://www2009.eprints.org/cgi/json?callback=function&eprintid=number
To prevent google killing the server by hammering these tools, the /cgi/ URL's are denied to robots.txt - ask Chris if you want an exception made.
Feel free to contact me (Christopher Gutteridge) with any other queries or suggestions. ...Or if you do something cool with the data which we should link to!
These are not directly related to the EPrints set up, but may be of use to delegates.
- Social tool links
- I've put links in the page header to the WWW2009 stuff on flickr, facebook and to a page which will let you watch the #www2009 tag on Twitter. Not really the right place, but not yet made it onto the main conference homepage. Send me any suggestions for new links.