Items where author is affiliated with Yahoo! Inc.
Number of items: 9.
and Raghavan, Hema
and Leggetter, Chris Discovering Users' Specific Geo Intention in Web Search.
Discovering users’ speciﬁc and implicit geographic intention in web search can greatly help satisfy users’ information needs. We build a geo intent analysis system that uses minimal supervision to learn a model from large amounts of web-search logs for this discovery. We build a city language model, which is a probabilistic representation of the language surrounding the mention of a city in web queries. We use several features derived from these language models to: (1) identify users’ implicit geo intent and pinpoint the city corresponding to this intent, (2) determine whether the geo-intent is localized around the users’ current geographic location, (3) predict cities for queries that have a mention of an entity that is located in a speciﬁc place. Experimental results demonstrate the effectiveness of using features derived from the city language model. We ﬁnd that (1) the system has over 90% precision and more than 74% accuracy for the task of detecting users’ implicit city level geo intent (2) the system achieves more than 96% accuracy in determining whether implicit geo queries are local geo queries, neighbor region geo queries or none-of these (3) the city language model can effectively retrieve cities in locationspeciﬁc queries with high precision (88%) and recall (74%); human evaluation shows that the language model predicts city labels for location-speciﬁc queries with high accuracy (84.5%).
and Rajan, Suju
and Narayanan, Vijay K. Large Scale Multi-Label Classification via MetaLabeler.
The explosion of online content has made the management of such content non-trivial. Web-related tasks such as web page categorization, news ﬁltering, query categorization, tag recommendation, etc. often involve the construction of multilabel categorization systems on a large scale. Existing multilabel classiﬁcation methods either do not scale or have unsatisfactory performance. In this work, we propose MetaLabeler to automatically determine the relevant set of labels for each instance without intensive human involvement or expensive cross-validation. Extensive experiments conducted on benchmark data show that the MetaLabeler tends to outperform existing methods. Moreover, MetaLabeler scales to millions of multi-labeled instances and can be deployed easily. This enables us to apply the MetaLabeler to a large scale query categorization problem in Yahoo!, yielding a signiﬁcant improvement in performance.
Gupta, Manish Predicting Click Through Rate for Job Listings.
Click Through Rate (CTR) is an important metric for ad systems, job portals, recommendation systems. CTR impacts publisher’s revenue, advertiser’s bid amounts in “pay for performance” business models. We learn regression models using features of the job, optional click history of job, features of “related” jobs. We show that our models predict CTR much better than predicting avg. CTR for all job listings, even in absence of the click history for the job listing.
and Maghoul, Farzin Query Clustering using Click-Through Graph.
In this p aper w e describe a problem of d iscovering query clusters from a click -through graph of w eb search logs. The graph consists of a set of web search queries, a set of pag es selected for the queries, and a set of d irected edges that connects a query node and a page node click ed by a user for the query. The proposed method extracts all m axim al b ipartite cliques (b icliques) from a click-through graph and compute an equiv alence set of queries (i.e., a query cluster) from the m axim al bicliques. A cluster of queries is form ed from th e queries in a biclique. We present a scalable algorithm that enumerates all maximal bicliques from the click-through graph. We h ave conducted experim ents on Yahoo web search queries and the result is p romising.
and Duan, Lei
and Zhou, Yiping
and Dom, Byron Threshold Selection for Web-Page Classification with Highly Skewed Class Distribution.
We propose a novel cost-efficient approach to threshold selection for binary web-page classification problems with imbalanced class distributions. In many binary-classification tasks the distribution of classes is highly skewed. In such problems, using uniform random sampling in constructing sample sets for threshold setting requires large sample sizes in order to include a statistically sufficient number of examples of the minority class. On the other hand, manually labeling examples is expensive and budgetary considerations require that the size of sample sets be limited. These conflicting requirements make threshold selection a challenging problem. Our method of sample-set construction is a novel approach based on stratified sampling, in which manually labeled examples are expanded to reflect the true class distribution of the web-page population. Our experimental results show that using false positive rate as the criterion for threshold setting results in lower-variance threshold estimates than using other widely used accuracy measures such as F1 and precision.
and Drome, Chris
and Kolay, Santanu Thumbs-Up: A Game for Playing to Rank Search Results.
Human computation is an effective way to channel human effort spent playing games to solving computational problems that are easy for humans but difficult for computers to automate. We propose Thumbs-Up, a new game for human computation with the purpose of playing to rank search result. Our experience from users shows that Thumbs-Up is not only fun to play, but produces more relevant rankings than both a major search engine and optimal rank aggregation using the Kemeny rule.
and Vandelle, Gilles Unsupervised Query Categorization using Automatically-Built Concept Graphs.
Automatic categorization of user queries is an important component of general purpose (Web) search engines, particularly for triggering rich, query-speciﬁc content and sponsored links. We propose an unsupervised learning scheme that reduces dramatically the cost of setting up and maintaining such a categorizer, while retaining good categorization power. The model is stored as a graph of concepts where graph edges represent the cross-reference between the concepts. Concepts and relations are extracted from query logs by an oﬄine Web mining process, which uses a search engine as a powerful summarizer for building a concept graph. Empirical evaluation indicates that the system compares favorably on publicly available data sets (such as KDD Cup 2005) as well as on portions of the current query stream of Yahoo! Search, where it is already changing the experience of millions of Web search users.
and Huynh, Xinh User-Centric Content Freshness Metrics for Search Engines.
In order to return relevant search results, a search engine must keep its local repository synchronized to the Web, but it is usually impossible to attain perfect freshness. Hence, it is vital for a production search engine continually to monitor and improve repository freshness. Most previous freshness metrics, formulated in the context of developing better synchronization policies, focused on the web crawler while ignoring other parts of a search engine. But, the freshness of documents in a web crawler does not necessarily translate directly into the freshness of search results as seen by users. We propose metrics for measuring freshness from a user’s perspective, which take into account the latency between when documents are crawled and when they are viewed by users, as well as the variation in user click and view frequency among different documents. We also describe a practical implementation of these metrics that were used in a production search engine.
and Dasdan, Ali The Value of Socially Tagged URLs for a Search Engine.
Social bookmarking has emerged as a growing source of human generated content on the web. In essence, bookmarking involves URLs and tags on them. In this paper, we perform a large scale study of the usefulness of bookmarked URLs from the top social bookmarking site Delicious. Instead of focusing on the dimension of tags, which has been covered in the previous work, we explore social bookmarking from the dimension of URLs. More speciﬁcally, we investigate the Delicious URLs and their content to quantify their value to a search engine. For their value in leading to good content, we show that the Delicious URLs have higher quality content and more external outlinks. For their value in satisfying users, we show that the Delicious URLs have more clicked URLs as well as get more clicks. We suggest that based on their value, the Delicious URLs should be used as another source of seed URLs for crawlers.
About this site
This website has been set up for WWW2009 by Christopher Gutteridge of the University of Southampton, using our EPrints software.
Add your Slides, Posters, Supporting data, whatnots...
If you are presenting a paper or poster and have slides or supporting material you would like to have permentently made public at this website, please email
firstname.lastname@example.org - Include the file(s), a note to say if they are presentations, supporting material or whatnot, and the URL of the paper/poster from this site. eg. http://www2009.eprints.org/128/
It's impractical to add all the workshops at WWW2009 by hand, but if you can provide me with the metadata in a machine readable way, I'll have a go at importing it. If you are good at slinging XML, my ideal import format is visible at http://www2009.eprints.org/import_example.xml
We (Southampton EPrints Project) intend to preserve the files and HTML pages of this site for many years, however we will turn it into flat files for long term preservation. This means that at some point in the months after the conference the search, metadata-export, JSON interface, OAI etc. will be disabled as we "fossilize" the site. Please plan accordingly. Feel free to ask nicely for us to keep the dynamic site online longer if there's a rally good (or cool) use for it...
- WWW2009 EPrints supports OAI 2.0 with a base URL of http://www2009.eprints.org/cgi/oai2
- The JSON URL is http://www2009.eprints.org/cgi/json?callback=function&eprintid=number
To prevent google killing the server by hammering these tools, the /cgi/ URL's are denied to robots.txt - ask Chris if you want an exception made.
Feel free to contact me (Christopher Gutteridge) with any other queries or suggestions. ...Or if you do something cool with the data which we should link to!
These are not directly related to the EPrints set up, but may be of use to delegates.
- Social tool links
- I've put links in the page header to the WWW2009 stuff on flickr, facebook and to a page which will let you watch the #www2009 tag on Twitter. Not really the right place, but not yet made it onto the main conference homepage. Send me any suggestions for new links.