I kept notes on various presentations during the www2009 conference. Unfortunately, the are rather sparce, but I wanted to do something with them before I completely forgot the context. So here are the rough notes. Think of it as liveblogging but not really live.
Socializing Big Data – Jeff Hammerbacher 4/22/09 12:20 PM
Hadoop is open to everyone and will play a central part in future development
Hadoop was the name of the developer’s son’s stuffed elephant toy
Written mostly in Java, inspired by Google infrastructure
- plain architecture, no major hardware.
- Resources: reddit, hacker news
- Data sharing sites
- Data.gov, many eyes, swivel, theinfo.org, infochimps, icharts
- Hadoop is for offline processing
- Amazon provides Hadoop connectivity
WebNC: efficient sharing of web apps 4/22/09 12:20 PM
WebNC – presented by Andreas Girgensohn of FX Palo Alto Labaratory
- share a web browser window, conference calls. Like adobe connect
- Requires Firefox for the presenter
- HTML, JavaScript works on iPhone, non-flash
How it works
- presenter computer has a Firefox extension that grabs the screen 3x a second and breaks it into a grid of images
- This is passed to the server that stores the grid and any changes.
- The viewers extension rebuilds the grid to display current view and any changes.
- There are some c++ logic to display native controls, such as form elements.
The Slashdot Zoo: Mining a Social Network with Negative Edges 4/22/09 12:20 PM
Jérôme Kunegis, Andreas Lommatzsch and Christian Bauckhage
- social network with negative edges
- multiplication rule applied to such networks
- analysis at global, node, edge level
- Slashdot allows you to assign a person as your friend, freak, or neutral
- the enemy of my enemy is my friend
- use this logic to predict trolls and popularity of freaks and friends.
- Used PageRank and modifiedPageRank to predict the strength of a user
Jun Wang, Xavier Amatriain and David Garcia Garzon. Chinese Academy of Sciences, Telefonica Research, David Garcia Garzon
- the clam framework
- started in October 2000
- framework for audio and music.
- Object oriented and has been documented through a Pattern Language and DSL
- Can be used for rapid prototyping and real time applications
Yahoo! Pipes like interface to analyze and modify music.
- Audio visualizations display chords and their frequency
- Semantic web crawler is searching the web for information about a song
- The extractor is using audio fingerprinting and metadata to identify songs on
- MusicBrainz (MBID)
- It then outputs rdf statements linking local files with remote web identifiers
- It extracts high level descriptors such as editorial metadata, user comments, reviews, tags…
- An xml is generated with all of the information generated by the service
- Potential collaboration with annotating videos/television at realtime
Microdonations for Stopping Spam 4/22/09 12:20 PM
Sharad Goel, Jake Hofman, John Langford, David Pennock and Daniel Reeves, Yahoo! Inc.
- Domain filtering has been the most straightforward approach to combating spam. Reputable sources can also be whitelisted.
- Secure hash also insures reputable messages are not altered along the way
- Content filtering looks at patterns and text within the message that signals spam.
- Spammers are constantly trying to get around the filters.
- Economic approaches: charge user in some way for sending a message. Charge via memory cycles, captcha, or monetary payments
- What is the amount to pay? It has to be expensive for spammers but cheap enough for users
- Computational costs are not a problem with spammers using zombie botnets
- Human cycles, via captchas, are annoying and would cause people to go elsewhere
- Early adopters are penalized for doing this as the reduction will not be visible until majority of people use it.
- CentMail: users get stamps for sending messages.
- Stamps promoted via signatures to let people know about the donations
- Even limited adoption would help reduce the spam proliferation. Spam filters will recognize these stamps as whitelist and will not end up in the junk filters.
- Certify service defines the certification for the message that it is valid
- Verify service recognizes the certificate.
Ranking for Search 4/22/09 2:20 PM
Editorial Judgements as truth
Pros
- Control full process
- Can calibrate judges to a consensus
Cons
Position Bias
- Position bias throws of clicks as a source of truth, what does no click mean?
- An item that isn’t displayed cannot be clicked
Social Labeling
- Social labeling game. Give people a boring task and make it fun to collect data.
- Playing a game removes the no click problem. The users can also mark an item as bad.
- Non-relevant content is not going to harm the brand as it is part of the game.
- Preference models
Frequency
- Pairwise probability model
- Models pairwise interactions but not comparison set
- Go Model
Compare vs. Editorial Judgments
- How does the pair-wise model compare against editorial judgments
- Use set of already collected editorial judgments for 60 of 427 queries.
- The game data can be better than the judged values.
Relative judgments can be made more consistently than absolute judgments
Rated Aspect Summarization of Short Comments 4/22/09 12:20 PM
Yue Lu, ChengXiang Zhai and Neel Sundaresan
Opinons across the web typically have an overall rating
On Ebay there is a feedback number, short phrases, and rating of positive, negative, or neutral.
- Ebay presents the percentage of feedback as positive or negative
- Distributions can also be seen. Ratings over a time span is important.
- How can you tell why they are good, i.e. shipping, quality, service…
Ebay wants to break this rating down into the various components. This would be generated from the large number of comments
Challenges
- How to identify coherent aspects?
- How to accurately rate each aspect
- How to get meaningful phrases supporting the ratings?
Ebay overall approach
- Aspect and discovery (fast + shipping) = positive. Head term clustering.
- Combine people that say fast + shipping
- Unstructured PLSA: assign high probability to modifiers i.e. fast and shipping
- Structured PLSA: create configuration file for each head term and the strengths associated when combined with a modifier
- A modifier can be associated with multiple head terms, i.e. fast (shipping, delivery, communications)
Mining the Web 2.0 for Better Search 4/22/09 12:20 PM
Dr. Ricardo Baeza-Yates, Yahoo! Research, Barcelona, Spain
amount of content on the web (2007)
- 8-10gb user generated content/day
- 3-4 gb published content
- private text content 3 tb
Examples
Explicit
- metadata
- rdf
- Wikipedia = UGC
- Open Data Project = UGC
- Yahoo Answers = UGC
- Flickr = UGC
Implicit
- text = UGC
- anchors and links
- queries and clicks – private
Crowdsourcing – wisdom of crowds
- importance of diversity, independence and decentralization – aggregating data
- popularity
- diversity – long tail
- quality
- coverage
- Detects a trend in search queries and produces an online page for later reference
Related articles by Zemanta
VN:F [1.7.5_995]
Rating: 0.0/5 (0 votes cast)
VN:F [1.7.5_995]