Archive for the 'BOSS Functionality' Category

Searchmonkey data is now much easier in BOSS

SearchMonkeyBOSS’s integration with SearchMonkey provides great structural data. However, it’s a bit tricky to get results filled with the appropriate data. This blog post from the search team introduces the new SearchMonkey query filters: Accessing SearchMonkey Structured Objects via BOSS.

The SearchMonkey team has been encouraging developers to use our structured data to build semantic Web applications ever since we partnered with BOSS. Using the BOSS API, you can access SearchMonkey structured objects.

To restrict the result set to pages with SearchMonkey objects, just add “searchmonkey:<objecttype>” to your query. The result set from BOSS will only contain URLs that have objects of that type.
Accessing SearchMonkey Structured Objects via BOSS

Here’s a list of SearchMonkey filters:

  • searchmonkey:video – restricts the result set to videos.
  • searchmonkey:product – restricts the result set to products.
  • searchmonkey:local – restricts the result set to local businesses.
  • searchmonkey:event – restricts the result set to events.
  • searchmonkey:document – restricts the result set to presentations, spreadsheets, and similar document formats.
  • searchmonkey:discussion – restricts the result set to blogs and forums.
  • searchmonkey:game – restricts the result set to Flash games.

Here’s a sample search request for iPhone products:
http://boss.yahooapis.com/ysearch/web/v1/iphone+searchmonkey:product?appid=insert-your-appid&format=xml&start=0&count=15&view=keyterms,searchmonkey_rdf

Related articles by Zemanta

VN:F [1.7.5_995]
Rating: 0.0/5 (0 votes cast)
VN:F [1.7.5_995]
Rating: 0 (from 0 votes)

Watch that Request Length

Yahoo! BOSS’s sites param gives us great flexibility in creating vertical search engines. However, we are limited by the number of characters. Here are some tips in keeping the length as short as possible.

  1. Forget passing subdirectories: foo.com/bar is considered the same as foo.com. BOSS will differentiate between subdomains bar.foo.com is not the same as foo.com. In a real-world example passing finance.yahoo.com/news will be interpreted as finance.yahoo.com. But finance.yahoo.com will give a different result than sports.yahoo.com.
  2. Remove www from the url. This is just wasting space. There may be an exception when the site was not setup to work without the www subdomain. I doubt this would make an impact.
VN:F [1.7.5_995]
Rating: 0.0/5 (0 votes cast)
VN:F [1.7.5_995]
Rating: 0 (from 0 votes)

Dynamic “sites” Creation for Vertical Search

Yahoo! BOSSI recently did a presentation in London with Skills Matter about Yahoo! BOSS. The small group was filled with ideas about extending BOSS functionality. I wrote a new post for the Yahoo Developer Network that expands on some of these concepts: Make BOSS More Dynamic.

The post discusses the idea of generating the “sites” argument, which tells BOSS to limit the results to a specified list of web sites, dynamically for each query. This allows each query to determine what sites are experts and then create a result set based on those experts.

I have built a prototype and will release it this week after I have a time to clean up some of the loose ends.

Reblog this post [with Zemanta]
VN:F [1.7.5_995]
Rating: 0.0/5 (0 votes cast)
VN:F [1.7.5_995]
Rating: 0 (from 0 votes)

PopGist adds discussions to search

PopGist search engine
AltSearchEngines has a feature on the new search engine PopGist.

PopGist improves the relevancy of the existing search engines and provides better user search experience. The prototype in popgist.com demonstrates the power of PopGist technology by utilizing the search feed provided by Yahoo BOSS. Comparing with the original Yahoo search results, the overlap in the top ten results between Google and Yahoo significantly increases by using PopGist search ranking algorithm for most of the queries tested.
PopGist improves the relevancy of major search engines – AltSearchEngines

PopGist uses BOSS, but reranks the results and adds threads of discussions to give you a more rounded result page.

Reblog this post [with Zemanta]
VN:F [1.7.5_995]
Rating: 0.0/5 (0 votes cast)
VN:F [1.7.5_995]
Rating: 0 (from 0 votes)

www2009 Madrid conference notes

I kept notes on various presentations during the www2009 conference. Unfortunately, the are rather sparce, but I wanted to do something with them before I completely forgot the context. So here are the rough notes. Think of it as liveblogging but not really live. :)

Socializing Big Data – Jeff Hammerbacher 4/22/09 12:20 PM

Hadoop is open to everyone and will play a central part in future development

Hadoop was the name of the developer’s son’s stuffed elephant toy

Written mostly in Java, inspired by Google infrastructure

  • plain architecture, no major hardware.
  • Resources: reddit, hacker news
  • Data sharing sites
    • Data.gov, many eyes, swivel, theinfo.org, infochimps, icharts
  • Hadoop is for offline processing
  • Amazon provides Hadoop connectivity

WebNC: efficient sharing of web apps 4/22/09 12:20 PM

WebNC – presented by Andreas Girgensohn of FX Palo Alto Labaratory

  • share a web browser window, conference calls. Like adobe connect
  • Requires Firefox for the presenter
  • HTML, JavaScript works on iPhone, non-flash

How it works

  1. presenter computer has a Firefox extension that grabs the screen 3x a second and breaks it into a grid of images
  2. This is passed to the server that stores the grid and any changes.
  3. The viewers extension rebuilds the grid to display current view and any changes.
  4. There are some c++ logic to display native controls, such as form elements.

The Slashdot Zoo: Mining a Social Network with Negative Edges 4/22/09 12:20 PM

Jérôme Kunegis, Andreas Lommatzsch and Christian Bauckhage

  • social network with negative edges
  • multiplication rule applied to such networks
  • analysis at global, node, edge level
  • Slashdot allows you to assign a person as your friend, freak, or neutral
  • the enemy of my enemy is my friend
  • use this logic to predict trolls and popularity of freaks and friends.
  • Used PageRank and modifiedPageRank to predict the strength of a user

Combining multi-level audio descriptors 4/22/09 17:20 PM

Jun Wang, Xavier Amatriain and David Garcia Garzon. Chinese Academy of Sciences, Telefonica Research, David Garcia Garzon

  • the clam framework
  • started in October 2000
  • framework for audio and music.
  • Object oriented and has been documented through a Pattern Language and DSL
  • Can be used for rapid prototyping and real time applications

Yahoo! Pipes like interface to analyze and modify music.

  • Audio visualizations display chords and their frequency
  • Semantic web crawler is searching the web for information about a song
    • The extractor is using audio fingerprinting and metadata to identify songs on
    • MusicBrainz (MBID)
    • It then outputs rdf statements linking local files with remote web identifiers
    • It extracts high level descriptors such as editorial metadata, user comments, reviews, tags…
  • An xml is generated with all of the information generated by the service
  • Potential collaboration with annotating videos/television at realtime

Microdonations for Stopping Spam 4/22/09 12:20 PM

Sharad Goel, Jake Hofman, John Langford, David Pennock and Daniel Reeves, Yahoo! Inc.

  • Domain filtering has been the most straightforward approach to combating spam. Reputable sources can also be whitelisted.
    • Secure hash also insures reputable messages are not altered along the way
  • Content filtering looks at patterns and text within the message that signals spam.
    • Spammers are constantly trying to get around the filters.
  • Economic approaches: charge user in some way for sending a message. Charge via memory cycles, captcha, or monetary payments
    • What is the amount to pay? It has to be expensive for spammers but cheap enough for users
    • Computational costs are not a problem with spammers using zombie botnets
    • Human cycles, via captchas, are annoying and would cause people to go elsewhere
    • Early adopters are penalized for doing this as the reduction will not be visible until majority of people use it.
  • CentMail: users get stamps for sending messages.
    • Stamps promoted via signatures to let people know about the donations
  • Even limited adoption would help reduce the spam proliferation. Spam filters will recognize these stamps as whitelist and will not end up in the junk filters.
  • Certify service defines the certification for the message that it is valid
  • Verify service recognizes the certificate.

Ranking for Search 4/22/09 2:20 PM

Editorial Judgements as truth

Pros

  • Control full process
  • Can calibrate judges to a consensus

Cons

  • Ownership of query

Position Bias

  • Position bias throws of clicks as a source of truth, what does no click mean?
  • An item that isn’t displayed cannot be clicked

Social Labeling

  • Social labeling game. Give people a boring task and make it fun to collect data.
  • Playing a game removes the no click problem. The users can also mark an item as bad.
  • Non-relevant content is not going to harm the brand as it is part of the game.
  • Preference models

Frequency

  • Pairwise probability model
  • Models pairwise interactions but not comparison set
  • Go Model

Compare vs. Editorial Judgments

  • How does the pair-wise model compare against editorial judgments
  • Use set of already collected editorial judgments for 60 of 427 queries.
  • The game data can be better than the judged values.

Relative judgments can be made more consistently than absolute judgments


Rated Aspect Summarization of Short Comments 4/22/09 12:20 PM

Yue Lu, ChengXiang Zhai and Neel Sundaresan

Opinons across the web typically have an overall rating

On Ebay there is a feedback number, short phrases, and rating of positive, negative, or neutral.

  • Ebay presents the percentage of feedback as positive or negative
  • Distributions can also be seen. Ratings over a time span is important.
  • How can you tell why they are good, i.e. shipping, quality, service…

Ebay wants to break this rating down into the various components. This would be generated from the large number of comments

Challenges

  • How to identify coherent aspects?
  • How to accurately rate each aspect
  • How to get meaningful phrases supporting the ratings?

Ebay overall approach

  • Aspect and discovery (fast + shipping) = positive. Head term clustering.
  • Combine people that say fast + shipping
  • Unstructured PLSA: assign high probability to modifiers i.e. fast and shipping
  • Structured PLSA: create configuration file for each head term and the strengths associated when combined with a modifier
    • A modifier can be associated with multiple head terms, i.e. fast (shipping, delivery, communications)

Mining the Web 2.0 for Better Search 4/22/09 12:20 PM

Dr. Ricardo Baeza-Yates, Yahoo! Research, Barcelona, Spain

amount of content on the web (2007)

  • 8-10gb user generated content/day
  • 3-4 gb published content
  • private text content 3 tb

Examples

Explicit

  • metadata
  • rdf
  • Wikipedia = UGC
  • Open Data Project = UGC
  • Yahoo Answers = UGC
  • Flickr = UGC

Implicit

  • text = UGC
  • anchors and links
  • queries and clicks – private

Crowdsourcing – wisdom  of crowds

  • importance of diversity, independence and decentralization – aggregating data
  • popularity
  • diversity – long tail
  • quality
  • coverage

Searchpad

  • Detects a trend in search queries and produces an online page for later reference

Related articles by Zemanta

VN:F [1.7.5_995]
Rating: 0.0/5 (0 votes cast)
VN:F [1.7.5_995]
Rating: 0 (from 0 votes)