www2009 Madrid conference notes

I kept notes on various presentations during the www2009 conference. Unfortunately, the are rather sparce, but I wanted to do something with them before I completely forgot the context. So here are the rough notes. Think of it as liveblogging but not really live. :)

Socializing Big Data – Jeff Hammerbacher 4/22/09 12:20 PM

Hadoop is open to everyone and will play a central part in future development

Hadoop was the name of the developer’s son’s stuffed elephant toy

Written mostly in Java, inspired by Google infrastructure

  • plain architecture, no major hardware.
  • Resources: reddit, hacker news
  • Data sharing sites
    • Data.gov, many eyes, swivel, theinfo.org, infochimps, icharts
  • Hadoop is for offline processing
  • Amazon provides Hadoop connectivity

WebNC: efficient sharing of web apps 4/22/09 12:20 PM

WebNC – presented by Andreas Girgensohn of FX Palo Alto Labaratory

  • share a web browser window, conference calls. Like adobe connect
  • Requires Firefox for the presenter
  • HTML, JavaScript works on iPhone, non-flash

How it works

  1. presenter computer has a Firefox extension that grabs the screen 3x a second and breaks it into a grid of images
  2. This is passed to the server that stores the grid and any changes.
  3. The viewers extension rebuilds the grid to display current view and any changes.
  4. There are some c++ logic to display native controls, such as form elements.

The Slashdot Zoo: Mining a Social Network with Negative Edges 4/22/09 12:20 PM

Jérôme Kunegis, Andreas Lommatzsch and Christian Bauckhage

  • social network with negative edges
  • multiplication rule applied to such networks
  • analysis at global, node, edge level
  • Slashdot allows you to assign a person as your friend, freak, or neutral
  • the enemy of my enemy is my friend
  • use this logic to predict trolls and popularity of freaks and friends.
  • Used PageRank and modifiedPageRank to predict the strength of a user

Combining multi-level audio descriptors 4/22/09 17:20 PM

Jun Wang, Xavier Amatriain and David Garcia Garzon. Chinese Academy of Sciences, Telefonica Research, David Garcia Garzon

  • the clam framework
  • started in October 2000
  • framework for audio and music.
  • Object oriented and has been documented through a Pattern Language and DSL
  • Can be used for rapid prototyping and real time applications

Yahoo! Pipes like interface to analyze and modify music.

  • Audio visualizations display chords and their frequency
  • Semantic web crawler is searching the web for information about a song
    • The extractor is using audio fingerprinting and metadata to identify songs on
    • MusicBrainz (MBID)
    • It then outputs rdf statements linking local files with remote web identifiers
    • It extracts high level descriptors such as editorial metadata, user comments, reviews, tags…
  • An xml is generated with all of the information generated by the service
  • Potential collaboration with annotating videos/television at realtime

Microdonations for Stopping Spam 4/22/09 12:20 PM

Sharad Goel, Jake Hofman, John Langford, David Pennock and Daniel Reeves, Yahoo! Inc.

  • Domain filtering has been the most straightforward approach to combating spam. Reputable sources can also be whitelisted.
    • Secure hash also insures reputable messages are not altered along the way
  • Content filtering looks at patterns and text within the message that signals spam.
    • Spammers are constantly trying to get around the filters.
  • Economic approaches: charge user in some way for sending a message. Charge via memory cycles, captcha, or monetary payments
    • What is the amount to pay? It has to be expensive for spammers but cheap enough for users
    • Computational costs are not a problem with spammers using zombie botnets
    • Human cycles, via captchas, are annoying and would cause people to go elsewhere
    • Early adopters are penalized for doing this as the reduction will not be visible until majority of people use it.
  • CentMail: users get stamps for sending messages.
    • Stamps promoted via signatures to let people know about the donations
  • Even limited adoption would help reduce the spam proliferation. Spam filters will recognize these stamps as whitelist and will not end up in the junk filters.
  • Certify service defines the certification for the message that it is valid
  • Verify service recognizes the certificate.

Ranking for Search 4/22/09 2:20 PM

Editorial Judgements as truth

Pros

  • Control full process
  • Can calibrate judges to a consensus

Cons

  • Ownership of query

Position Bias

  • Position bias throws of clicks as a source of truth, what does no click mean?
  • An item that isn’t displayed cannot be clicked

Social Labeling

  • Social labeling game. Give people a boring task and make it fun to collect data.
  • Playing a game removes the no click problem. The users can also mark an item as bad.
  • Non-relevant content is not going to harm the brand as it is part of the game.
  • Preference models

Frequency

  • Pairwise probability model
  • Models pairwise interactions but not comparison set
  • Go Model

Compare vs. Editorial Judgments

  • How does the pair-wise model compare against editorial judgments
  • Use set of already collected editorial judgments for 60 of 427 queries.
  • The game data can be better than the judged values.

Relative judgments can be made more consistently than absolute judgments


Rated Aspect Summarization of Short Comments 4/22/09 12:20 PM

Yue Lu, ChengXiang Zhai and Neel Sundaresan

Opinons across the web typically have an overall rating

On Ebay there is a feedback number, short phrases, and rating of positive, negative, or neutral.

  • Ebay presents the percentage of feedback as positive or negative
  • Distributions can also be seen. Ratings over a time span is important.
  • How can you tell why they are good, i.e. shipping, quality, service…

Ebay wants to break this rating down into the various components. This would be generated from the large number of comments

Challenges

  • How to identify coherent aspects?
  • How to accurately rate each aspect
  • How to get meaningful phrases supporting the ratings?

Ebay overall approach

  • Aspect and discovery (fast + shipping) = positive. Head term clustering.
  • Combine people that say fast + shipping
  • Unstructured PLSA: assign high probability to modifiers i.e. fast and shipping
  • Structured PLSA: create configuration file for each head term and the strengths associated when combined with a modifier
    • A modifier can be associated with multiple head terms, i.e. fast (shipping, delivery, communications)

Mining the Web 2.0 for Better Search 4/22/09 12:20 PM

Dr. Ricardo Baeza-Yates, Yahoo! Research, Barcelona, Spain

amount of content on the web (2007)

  • 8-10gb user generated content/day
  • 3-4 gb published content
  • private text content 3 tb

Examples

Explicit

  • metadata
  • rdf
  • Wikipedia = UGC
  • Open Data Project = UGC
  • Yahoo Answers = UGC
  • Flickr = UGC

Implicit

  • text = UGC
  • anchors and links
  • queries and clicks – private

Crowdsourcing – wisdom  of crowds

  • importance of diversity, independence and decentralization – aggregating data
  • popularity
  • diversity – long tail
  • quality
  • coverage

Searchpad

  • Detects a trend in search queries and produces an online page for later reference

Related articles by Zemanta

VN:F [1.7.5_995]
Rating: 0.0/5 (0 votes cast)
VN:F [1.7.5_995]
Rating: 0 (from 0 votes)

0 Responses to “www2009 Madrid conference notes”


  1. No Comments

Leave a Reply

Powered by WP Hashcash