I Learned It the Hard Way: Observations about Search Interface Design and Evaluation

I Learned It the Hard Way: Observations about Search Interface Design and Evaluation Marti Hearst UC Berkeley

Outline • Why is Supporting Search Difficult? • What Works? • How to Evaluate?

Search Interface Evaluation • Timing Data • Matching Users to Tasks • Spool’s Treasure Hunt Technique

Highly Motivated Participants • Jared Spool makes this claim

Fancy Often Fails

Use Topic-Matched Users

Timing • Information-intensive interfaces are very sensitive to: • Task effects • Match of task to search results • Participants’ familiarity with task topic • Task difficulty • In general • With respect to this system • Individual differences • Reading ability • Reading style (scan vs. read thoroughly) • General knowledge and reasoning strategies • (CHI Browse-Off) • Spatial ability • Timing isn’t everything • Subjective assessment • Return usage • Longitudinal studies are often quite revealing • Browsing interfaces: longer can be better

Cool Doesn’t Cut It • It’s very difficult to design a search interface that users prefer over the standard • Some ideas have a strong WOW factor • Examples: • Kartoo • Groxis • Hyperbolic tree • But they don’t pass the “will you use it” test • Even some simpler ideas fall by the wayside • Example: • Visual ranking indicators for results set listings

Early Visual Rank Indicators

Metadata Matters • When used correctly, text to describe text, images, video, etc. works well • “Searchers” often turn into “browsers” with approapriate links • However, metadata has many perils • The Kosher Recipe Incident

Small Details Matter • UIs for search especially require great care in small details • In part due to the text-heavy nature of search • A tension between more information and introducing clutter • How and where to place things important • People tend to scan or skim • Only a small percentage reads instructions

Small Details Matter • UIs for search especially require endless tiny adjustments • In part due to the text-heavy nature of search • Example: • In an earlier version of the Google Spellchecker, people didn’t always see the suggested correction • Used a long sentence at the top of the page: “If you didn’t find what you were looking for …” • People complained they got results, but not the right results. • In reality, the spellchecker had suggested an appropriate correction. • Interview with Marissa Mayer by Mark Hurst: http://www.goodexperience.com/columns/02/1015google.html

Small Details Matter • The fix: • Analyzed logs, saw people didn’t see the correction: • clicked on first search result, • didn’t find what they were looking for (came right back to the search page • scrolled to the bottom of the page, did not find anything • and then complained directly to Google • Solution was to repeat the spelling suggestion at the bottom of the page. • More adjustments: • The message is shorter, and different on the top vs. the bottom • Interview with Marissa Mayer by Mark Hurst: http://www.goodexperience.com/columns/02/1015google.html

Small Details Matter • Layout, font, and whitespace for information-centric interfaces requires very careful design • Example: • Photo thumbnails • Search results summaries

Searching Earthquakes at UCB:Standard Way

Searching Earthquakes at UCBwith Cha-Cha

Query: Seaborg

Query: “Phase II”

TileBars • Graphical Representation of Term Distribution and Overlap • Simultaneously Indicate: • relative document length • query term frequencies • query term distributions • query term overlap

Query terms: What roles do they play in retrieved documents? DBMS (Database Systems) Reliability Mainly about DBMS & reliability Mainly about DBMS, discusses reliability Mainly about banking, subtopic discussion on DBMS/Reliability Mainly about high-tech layoffs

Pagis Pro97

Pagis Pro 97 • Received three prestigious industry editorial awards • Windows Magazine's Win 100 award • Home PC selected Pagis Pro for its annual Hit Parade: • PC Computing rated Pagis Pro over competitors • Version 2.0: • MatchBars dropped! • “Our usability testing found that people didn't understand the simple 3-term/3-bar implementation.” • Replaced with a simple bar whose length is related to the score returned by Verity. I believe the problem was in our reduced implementation, and not in the fundamental idea.

Why is Supporting Search Difficult? • Everything is fair game • Abstractions are difficult to represent • The vocabulary disconnect • Users’ lack of understanding of the technology • Clutter vs. Information

Everything is Fair Game • The scope of what people search for is all of human knowledge and experience. • Other interfaces are more constrained (word processing, formulas, etc) • Interfaces must accommodate human differences in: • Knowledge / life experience • Cultural background and expectations • Reading / scanning ability and style • Methods of looking for things (pilers vs. filers)

Abstractions Are Hard to Represent • Text describes abstract concepts • Difficult to show the contents of text in a visual or compact manner • Exercise: • How would you show the preamble of the US Constitution visually? • How would you show the contents of Joyce’s Ulysses visually? How would you distinguish it from Homer’s TheOdyssey or McCourt’s Angela’s Ashes? • The point: it is difficult to show text without using text

Vocabulary Disconnect • If you ask a set of people to describe a set of things there is little overlap in the results.

The Vocabulary Problem Data sets examined (and # of participants) • Main verbs used by typists to describe the kinds of edits that they do (48) • Commands for a hypothetical “message decoder” computer program (100) • First word used to describe 50 common objects (337) • Categories for 64 classified ads (30) • First keywords for a each of a set of recipes (24) Furnas, Landauer, Gomez, Dumais: The Vocabulary Problem in Human-System Communication. Commun. ACM 30(11): 964-971 (1987)

The Vocabulary Problem These are really bad results • If one person assigns the name, the probability of it NOT matching with another person’s is about 80% • What if we pick the most commonly chosen words as the standard? Still not good: Furnas, Landauer, Gomez, Dumais: The Vocabulary Problem in Human-System Communication. Commun. ACM 30(11): 964-971 (1987)

Lack of Technical Understanding • Most people don’t understand the underlying methods by which search engines work.

People Don’t Understand Search Technology A study of 100 randomly-chosen people found: • 14% never type a url directly into the address bar • Several tried to use the address bar, but did it wrong • Put spaces between words • Combinations of dots and spaces • “nursing spectrum.com” “consumer reports.com” • Several use search form with no spaces • “plumber’slocal9” “capitalhealthsystem” • People do not understand the use of quotes • Only 16% use quotes • Of these, some use them incorrectly • Around all of the words, making results too restrictive • “lactose intolerance –recipies” • Here the – excludes the recipes • People don’t make use of “advanced” features • Only 1 used “find in page” • Only 2 used Google cache Hargattai, Classifying and Coding Online Actions, Social Science Computer Review 22(2), 2004 210-227.

People Don’t Understand Search Technology Without appropriate explanations, most of 14 people had strong misconceptions about: • ANDing vs ORing of search terms • Some assumed ANDing search engine indexed a smaller collection; most had no explanation at all • For empty results for query “to be or not to be” • 9 of 14 could not explain in a method that remotely resembled stop word removal • For term order variation “boat fire” vs. “fire boat” • Only 5 out of 14 expected different results • Understanding was vague, e.g.: • “Lycos separates the two words and searches for the meaning, instead of what’re your looking for. Google understands the meaning of the phrase.” Muramatsu & Pratt, “Transparent Queries: Investigating Users’ Mental Models of Search Engines, SIGIR 2001.

What Works?

What Works for Search Interfaces? • Query term highlighting • in results listings • in retrieved documents • Sorting of search results according to important criteria (date, author) • Grouping of results according to well-organized category labels (see Flamenco) • DWIM only if highly accurate: • Spelling correction/suggestions • Simple relevance feedback (more-like-this) • Certain types of term expansion • So far: not really visualization Hearst et al: Finding the Flow in Web Site Search, CACM45(9), 2002.

Highlighting Query Terms • Boldface or color • Adjacency of terms with relevant context is a useful cue.

found! found! don’t know don’t know Highlighted query term hits using Google toolbar Microso US Blackout PGA Microsoft

How to Introduce New Features? • Example: Yahoo “shortcuts” • Search engines now provide groups of enriched content • Automatically infer related information, such as sports statistics • Accessed via keywords • User can quickly specify very specific information • united 570 (flight arrival time) • map “san francisco” • We’re heading back to command languages!

Introducing New Features • A general technique: scaffolding • Scaffolding: • Facilitate a student’s ability to build on prior knowledge and internalize new information. • The activities provided in scaffolding instruction are just beyond the level of what the learner can do already. • Learning the new concept moves the learner up one “step” on the conceptual “ladder”

Scaffolding Example • The problem: how do people learn about these fantastic but unknown options? • Example: scaffolding the definition function • Where to put a suggestion for a definition? • Google used to simply hyperlink it next to the statistics for the word. • Now a hint appears to alert people to the feature.

I Learned It the Hard Way: Observations about Search Interface Design and Evaluation