460 likes | 609 Views
Search In Small World Networks. Presented by George Frederick. Searching the World Wide Web. Steve Lawrence C. Lee Giles. Overview . Research performed in 1998, so results are dated The Web is constantly growing and changing, making size estimation difficult
E N D
Search In Small World Networks Presented by George Frederick
Searching the World Wide Web Steve Lawrence C. Lee Giles
Overview • Research performed in 1998, so results are dated • The Web is constantly growing and changing, making size estimation difficult • Search engine companies claimed they could keep up • Tested actual search engine coverage
Overview • Typical coverage tests performed by checking number of results returned • Algorithm may not require exact search term matches (related terms) • Documents may no longer exist, meaning engines with stale data have an advantage • Documents may have been altered, changing their relevance
Market Share Salzberg and Etzioni attempted to calculate “market share” of search engines using their MetaCrawler aggregate search service Calculated as percentage of documents users followed through from each search engine
Market Share According to Salzber and Etzioni in 1997
Market Share • Drawbacks to calculation method • Difficult to for users to determine relevance without clicking through and examining pages first • User relevance judgments are biased by presentation order
Web Coverage • Selberg and Etzioni also attempted to calculate Web coverage for each search engine • Flawed because MetaCrawler only retrieves first few results from each engine • May return unique documents for for first few but rest are the same • May return same documents for first few but rest are unique
Web Coverage • Lawrence and Giles made own attempt at search engine Web coverage calculation • Analyzed Alta-Vista, Excite, HotBot, Infoseek, Lycos, and Northern Light • Google not publicly available at time of study • Commonly believed that each indexed roughly same material and each covers most of the Web
Data Gathering • Collected search engines’ responses to queries by NEC Research Institute employees over several days • Retrieved all matching indices and corresponding documents for each query in order to count • All indices needed to avoid page rank bias • All documents needed to ensure they still existed and were not altered since indexing
Data Gathering Duplicate documents not counted twice, even with different URLs Only lowercase queries considered Pages not displayed within a minute were not counted Considered only queries returning up to 600 results collectively Only exact search terms were counted 575 queries analyzed in total
Sources of Bias Size of Web based on search engine coverage overlap Biased because not all pages could be indexed Set of all pages that can be indexed by search engines is referred to as “indexable Web”
Sources of Bias Pages are often manually registered to several different search engines, meaning indices are not collected randomly Pages that are highly linked to by other pages are more likely to be indexed
Methodology • Authors expected that larger engines have lower dependence • Don’t rely as heavily on user submitted indices • Can crawl and find less popular pages • Assumption is that the larger an engine is, the more accurate an estimate it can provide of the size of the Web
Methodology Analyzed overlap between the largest engines (AltaVista and HotBot) Estimated that the “indexable Web” has a lower bound of 320 million pages Common earlier estimates ranged from only 75 - 200 million Authors assert that these were significant underestimations
Size Estimates Estimated indexable Web size
Size Estimates Percentage of indexable Web each search engine covers
Additional Analysis Percentage of invalid links
Additional Analysis Median age of documents Results suggest that engines with most recent pages don’t necessarily have best coverage Tradeoff between database size and update frequency
Results Search engine coverage varies by an order of magnitude Indexable Web estimated to have lower bound of 320 million pages Engines index only a fraction of the Web Individual engines covered between 3-34% of the indexable Web each
Conclusion Combining results of multiple search engines significantly increases number of unique results Search aggregators would significantly increase quality of search results
The Small-World Phenomenon and Decentralized Search Jon Kleinberg
Small World Search Related to gossip algorithms in that each node works with only local knowledge but presents emergent behavior Watts-Strogatz model involves d-dimensional lattice with uniformly random shortcuts Possible to prove that no decentralized search can find short paths with only local knowledge
Small World Search • Subtle variation involves shortcuts with probability that decays like the d’th power of their distance (in d dimensions) and would support efficient search • i.e. a node is approximately as likely to create shortcuts at distances 1 to 10 as it is at distances 10 to 100, 100 to 1000, etc.
Small World Search A node with several random shortcuts spanning different distance scales
Small World Search Construction of networks based on Watts-Strogatz variation has been successfully employed in peer-to-peer file-sharing systems and on the Internet
Identity and Search In Social Networks Duncan J. Watts Peter Sheridan Dodds M. E. J. Newman
Overview Proposes model that defines a class of searchable networks and a method for searching them that is applicable to many network search problems
Searchability Searchability is defined as the property of being able to find a target quickly Searchability has been shown to exist in scale-free and lattice networks, but neither is a satisfactory model of society
Social Network Model Authors assert that proposed model is based on plausible social structures Follows naturally from six contentions about social networks
1. Identities Nodes in social networks have identities in addition to relationships Identities are defined as a set of characteristics individuals attribute to themselves and others through their association In social groups A group is defined as a collection of nodes with a well-defined set of social characteristics
2. Hierarchical View • Individuals break down world into hierarchy of more and more specific layers • Top layer is the world • Bottom layer is the individual • In practice, individuals don’t usually go all the way down but stop at a cognitively manage layer • A reasonable upper bound on group size is g=100
2. Hierarchical View • The similarity xij between individuals i and j is the height of their lowest common ancestor level in this hierarchy • Xij = 1 if i and j are in the same group • Hierarchies are defined to have a depth l and branching ratio b • Purely cognitive construct for social distance measure, not actual network
3. Homophily The more similar individuals are, the more likely it is that they know each other To construct the network, randomly choose a node i and a link distance x with probability p(x) = c exp{-αx} α is a tunable parameter (measure of homophily) c is a normalizing constant
3. Homophily Choose a second node j uniformly from all nodes with distance x from i Repeat until the nodes have an average of z friends
3. Homophily When e-α « 1 all links will be as short as possible, meaning individuals only have connections to those most similar to themselves, forming many isolated cliques When e-α = b, individuals are equally likely to be linked with any other individuals, resulting in a uniform random graph
4. Multiple Hierarchies Individuals split the world into multiple hierarchies dependent upon context i.e. geography or occupation These represent different societal dimensions A node’s identity is defined as an H-dimensional coordinate vector vi where vih is the position of node i in the h’th hierarchy/dimension
4. Multiple Hierarchies Each i is randomly assigned a coordinate for each of the H hierarchies and allocated friends as previously described, randomly choosing a hierarchy h for each link When H = 1 and e-α « 1, the link density must obey the constraint z < g
5. Social Distance • Individuals have their own perception of “social distance” yij = minh xijh • Close proximity in just one hierarchy is sufficient to connote affiliation • Violates triangle inequality • Individuals i and j can be close in one hierarchy and individuals j and k close in another, but i and k may still be far apart in both hierarchies
6. Local Knowledge • Individuals have only local information • Its own coordinate vector vi • Its neighbors’ coordinate vectors vj • Its target’s coordinate vector vt • Social distance and network paths are known • Neither is sufficient for efficient searching • Both combined are capable
6. Local Knowledge Following Milgram’s example with each node forwarding a message to one other node j perceived to be closer to the target t
Modeling Milgram • Principle objective is to find average length <L> of a message chain between a randomly selected sender s and target t is small • Small has been previously defined to mean that <L> grows slowly with population size N • Insufficient as probability of chain termination at each hop of p=0.25 • Requires absolute bound
Modeling Milgram A searchable network is defined as one with a probability of successful delivery q is at least some fixed value r In terms of chain length, the authors formally require q = <(1-p)L> >= r Maximum required <L> <= ln r/ln (1-p) For experimental purposes, the authors set r=0.05 and p=0.25 and requiring <L> <= 10.4, independent of population size N
Conclusion By setting the parameters so, in accordance with the 6 sociological contentions, the authors are able to recreate Milgram’s results in the simulation Concept is general enough to be applied to many types of networks beyond social such as p2p, the Web, citation networks
Conclusion The multi-dimensional aspect of the model makes decentralized database organization and searching more efficient with simple, greedy algorithms
Papers • S. Lawrence and C.L. Giles. Searching the world wide web. Science, 280(4):98--100 (1998). http://citeseer.ist.psu.edu/lawrence98searching.html • J. Kleinberg. The Small-World Phenomenon and Decentralized Search. SIAM News 37(3), April 2004 • D. J. Watts, P. S. Dodds, M. E. J. Newman. Identity and Search in Social Networks. Science 269(5571), 2002.