{"id":1881,"date":"2013-07-26T07:45:51","date_gmt":"2013-07-26T07:45:51","guid":{"rendered":"http:\/\/peerproduction.net\/?page_id=1881"},"modified":"2013-07-31T15:09:08","modified_gmt":"2013-07-31T15:09:08","slug":"p2p-search-as-an-alternative-to-google-recapturing-network-value-through-decentralized-search","status":"publish","type":"page","link":"http:\/\/peerproduction.net\/editsuite\/issues\/issue-3-free-software-epistemics\/peer-reviewed-papers\/p2p-search-as-an-alternative-to-google-recapturing-network-value-through-decentralized-search\/","title":{"rendered":"P2P Search as an Alternative to Google: Recapturing network value through decentralized search"},"content":{"rendered":"
by Tyler Handley<\/span><\/strong><\/p>\n <\/p>\n <\/span><\/p>\n Defining Google as the most visited website in the world fails to give its prevalence the recognition it deserves. As a search engine, its role in our digital lives is of paramount importance. It is the first place many of us look for information on everything from healthcare to shopping. Following from this, how Google presents us with information is of great importance to the fundamental structure of information online. Along with its notable successes, it also presents its users with well defined problems of information diversity, autonomy, and privacy, all of which stem from various censorship and filtering practices.<\/span><\/p>\n In this paper I will approach these problems and propose P2P search as a conceivable alternative. Section 1.0 will focus on the history of the search engine, its importance in the Networked Information Economy, and how Google has capitalized upon the internet\u2019s wealth for monetary gain. Section 2.0 will focus on the various practices of censorship and filtering that are technologically endemic to Google\u2019s ranking mechanisms and central server approach to search. Section 3.0 will analyze alternatives to Google search, lay out five criteria for a search engine to solve Google\u2019s stated problems, and then apply the P2P search engine YaCy to these five criteria to discern its efficacy in alleviating them.<\/span><\/p>\n \n \n The original intent of the various packet-switching networks that came to form the Internet was to share information. The more people used the Internet to share, the more information became available. Unfortunately for users, the Internet lacked (and still lacks) a built in Information Retrieval System (IRT), making information location difficult. Archie was the first attempt at making information online accessible. It visited existing FTP servers and indexed the titles of all of the files, allowing people to find particular documents easier (Halavais, 2008, pp. 21). It did so by utilizing a basic keyword index. Gopher was a further advanced attempt to make FTP servers more accessible; it organized files into hierarchical categories (Halavais, 2008, pp. 22).<\/span><\/p>\n The idea of organizing information online hierarchically in categories carried over into the browser-centric environment of the World Wide Web (WWW) through Yahoo!. It relied on human agents to scour the web and then organize its information into themed categories. Such a system may have dealt with the amount of pre-WWW information on the internet in a useful way, but the commercialization of the WWW, especially after the release of the popular Mosaic browser in 1993, brought plenty more users online to share information. In 1994, Yahoo!\u201c… quickly ran into deep problems, both in terms of scale (impossibility to keep up with the growth of the Web) and ontology (the categorical system could not contain the complexity and dynamism of the information space it claimed to organize) (Feuze, Fuller and Stalder, 2011). Since the architecture of the WWW allowed anyone with a connection to post information, Yahoo! became inundated with new information on a scale not amenable to its search system. 1995 saw the release of AltaVista, a \u201cfaster and more comprehensive\u201d search engine that crawled the web with a spider and automatically indexed the information, as opposed to Yahoo! which manually indexed links with human agents (Feuze, Fuller and Stalder 2011). Doing an automated web crawl allowed AltaVista to catalogue and index a much wider range of links \u2013 a process that was in-line with the rapidly increasing amount of information online. Their search functionality was also much simpler for users. Users only needed to type their search request in a search bar, rather than look through categories for it.<\/span><\/p>\n Although AltaVista and Yahoo! differed in how they organized and presented results, they did share one concept in common: the idea that search should be intertwined with corporate interests. Both sites, and many others that came after, presented themselves as \u2018portals\u2019 through which users could access not only their search results, but also a myriad of other media and services (Feuze, Fuller and Stalder 2011), plenty of which were provided through third-party corporate entities. Acting as \u2018portals,\u2019 search sites had failed to take advantage of the a key concept in the distributed nature of the WWW: sharing. Distributed sharing was the catalyst of the internet\/WWW and, as I will argue later, a search site can work much better if it takes a more direct and social approach to defining what content on the WWW users would like to see. Instead, they pushed human edited content and search results that were tied to commercial interests (Feuze, Fuller and Stalder 2011).<\/span><\/p>\n <\/p>\n <\/span><\/p>\n Launched in 1998, Google took a different approach to search than any major engine before it. Instead of acting as a web portal, it offered users a clean interface, devoid of anything but the logo, a search box, and two buttons (search, and I\u2019m feeling lucky) (Weinberger, 2012).<\/span><\/p>\n Google\u2019s ranking algorithm returned much more accurate results than any other search engine. This accuracy came from Brin and Page\u2019s PageRank algorithm, which carried out an \u201c…objective rating of the importance of websites, considering more than 500 million variables and 2 billion terms\u201d by interpreting hyperlinks between websites as votes cast for one another (Google, 2008).<\/span><\/span><\/p>\n <\/p>\n <\/span><\/span><\/p>\n The democratic underpinnings of PageRank are unquestionable. Each node in a network casts votes as hyperlinks, the accumulation of which nominates an \u201celected\u201d authority over information. Concepts of democracy have always been closely tied to theories of the Internet, stemming from the inherent distributed nature of the WWW that allows any individual with a connection to participate.<\/span><\/p>\n Since PageRank uses hyperlinks as votes, and since these links are placed by individual users of the WWW, PageRank embodies what Benkler (2006) calls The Wealth of Networks.<\/span><\/p>\n In The Networked Information Economy, average citizens have the means of production and distribution that were once only available to people working within the confines of the\u00a0<\/span>Industrial Information Economy institutions\/organizations. As a result, the spread of information and culture is no longer primarily shaped by market-based or government based actors.<\/span><\/p>\n The rise in the non-proprietary use of networks to share culture and information, and the resulting distributed ownership of \u201cmaterial requirements\u201d for producing and sharing takes advantage of the distributed nature of the WWW. The Networked Information Economy has led to \u201ceffective, large-scale cooperative efforts\u201d like Wikipedia, open source software, and peer-to-peer networks. It is these distributed efforts, along with the ability to do so – and their output – that represents the Wealth of Networks.<\/span><\/p>\n <\/p>\n <\/span><\/p>\n The Wealth of Networks is a public space made up of the sum of human knowledge that exists online. As a search engine, Google’s wealth is a product of the connections it creates between pre-existing sources of wealth online. It extracts wealth from the WWW, indexes it, ranks it, and then presents it back to the users who initially uploaded the content (Pasquinelli, 2009; Jakobssen and <\/span>Stiernstedt, 2010).<\/span> More precisely, Google identifies the network value produced by the wealth accumulated through the social interactions of nodes in the WWW and then uses it as its own source of wealth. <\/span>It is part of a wider trend in Social Media that “mark[s] a shift to a new economy in which value is not embedded in social relations but in which social relations are a primary source of value” (Stark, 2009 pp. 173).<\/span><\/p>\n Where mass-media broadcasters in the Industrial Information Economy produced their own content and then used the attention they gained from it as a commodity to be sold to advertisers, Google pools content from the WWW and uses the traffic from people seeking that content as a commodity to sell to advertisers. It re-captures information that comes from the distributed structure of the internet to sell as a commodity.<\/span><\/p>\n It does so with the help of the Google AdWords and AdSense services. Google, having dominated the search market and consequently having millions of people view its homepage per day, was in the perfect position to revamp the advertising industry for the online environment. Being such a large site with an estimated 900 thousand servers (Koomey, 2011) also meant that it needed to monetize[1].<\/span><\/p>\n Google\u2019s idea with AdWords was to make advertising less intrusive and more accurate. They did so by making the sponsored results look very similar to the organic results, only delineated by appearing to the right and top of organic results, and by being discretely labeled \u201csponsored ad\u201d (Levy, 2011, pp. 91). The system made sure that only advertisements relevant to the particular search query appeared.<\/span><\/p>\n AdWords harnesses the attention given to search results pages. It has immense power, considering that Google is the most visited website in the world (Alexa, 2012). AdSense differs in that it allows anyone with a website to display behavioral advertising. In essence, behavioural advertising or \u201ctargeting\u201d works by allowing marketers to track what sites internet users visit by dropping cookies on their computer. Once tracked, users are compiled into pre-defined categories of interest, to which sites can then target specific ads (Long, 2007). <\/span><\/span>Compared to traditional \u201cbanner advertising,\u201d where the contextually of an ad was mostly guesswork, AdSense ads nearly always appeal to the visitor, who is then more likely to actually click through the ad.<\/span><\/p>\n AdSense works by dropping a cookie onto a user\u2019s hard-drive when they visit a page that contains an AdSense ad. This cookie then provides Google with a wealth valuable information. This information is combined with a user\u2019s search results (from the Google search page) to form a comprehensive log of information on said user (Levy, 2011, pp. 335). This information is not only used to display even more specific behavioural advertising to the user, but is also used in a recursive manner to constantly improve Google\u2019s search results and behavioural advertising[2].<\/span><\/p>\n Another way of framing Google\u2019s monetary monopoly is to take into perspective the concept of a natural monopoly. Pollock (2009) argues that the \u201c…very high fixed costs…combined with very low marginal costs,\u201d of Google represent characteristics of a natural utilities monopoly, much akin to electricity. A natural monopoly differs from a monopoly on the wealth of networks because it is simply a natural outcome of a large-scale service and not a proprietary model built on top of organizing others\u2019 data.<\/span><\/p>\n <\/p>\n <\/span><\/p>\n It\u2019s evident that Google’s desire to give the world access to any knowledge at the click of a button is buttressed by their ability to maximize profits by building ever more extensive logs about individual users.<\/span><\/p>\n These logs are a form of \u201cdataveillance,\u201d which gives Google a wealth of information that \u201callows analysis to inductively construct the audience for sale\u201d (Shaker, 2006). Consequently, Google is able to take a confusing mess of information and transform it into accessible categories for advertisers. This act is representative of a shift away from the \u201cmonolithic structures of state-surveillance\u201d towards a more dispersed organization of surveillance, typified by corporate entities (Haggerty and Ericson, 2000). Where as methods and tools of surveillance were once only available to the state, global corporate entities now have them at their disposal as well, and are using them to monitor citizens for monetary purposes (R<\/span>\u00d6<\/span>hle, 2007).<\/span><\/p>\n Google states that the main reason they collect and hold user data is to enhance ranking algorithms (Varian, 2008). Indeed, as Hoofnagle (2009) points out, they have strong incentives to collect a wealth of data to expand their advertising based business model. In this regard, \u201cInnovation is raised as a privacy tradeoff in the context of data retention\u201d (Hoofnagle, 2009).<\/span><\/p>\n It is this contestation between innovation and privacy that exemplifies Google\u2019s ethos in privacy matters; they must not only provide the user with the best results possible, but also make sure not to violate their privacy by doing so[3].<\/span><\/p>\n It is undeniable that wherever Google stores its user logs is of great interest to third-party sources, perhaps even ones with criminal intent. Such was the case when hackers, later traced back to locations within China, gained access to sensitive information in Google\u2019s servers. The hackers stole both valuable source code and access to the Gmail accounts of Chinese political dissidents and human rights activists (Levy, 2011, pp. 269)[4].<\/span><\/p>\n \n User logs located at \u201ccentral[5]\u201d locations are also prone to censorship and filtering practices, which consequently affects the quality of information users receive from the search engines themselves. These practices come from four sources: the technology that surrounds search, the Governments within which search servers reside, the monetary intent of search engines, and the mass-media economy that fights back against modern search. In this next section I will outline these four censorship and filtering practices, explain why they affect the sort of information users receive, and also delve into how advertising and increasing personalization further the problem.<\/span><\/span><\/p>\n \u00a0<\/p>\n \u00a0Halavais (2008, pp. 87) points to a variety of studies suggesting that searchers \u201csatisfice\u201d when looking for information. They won\u2019t seek out the best answer, rather one that is simply good enough given the small amount of time they are willing to spend. In other words they sacrifice knowing more because they are satisfied \u2013 they satisfice. Guan and Cutrell (2007) discovered – through eye-tracking studies – that users focus much more on the top results of a search page than the ones further down. Taking these points into account we can say that a large portion of people who use Google are looking for \u201cbasic information needs\u201d and are willing to settle for the easiest answers, which – given Google\u2019s accuracy – are most likely at the top of the search results. It follows that whatever information is first presented to users at the top of a Google result page is of great importance, for that is the information they are most likely to consume.<\/span><\/p>\n Google\u2019s results pages follow a Power Law pattern where a select few websites dominate the top results. Hindman et al. (2003) call this phenomenon, a \u201cGooglearchy.\u201d A Power Law structure is typified by a network in which \u201c…most nodes will be relatively poorly connected, while a select minority of hubs will be highly connected (Watts, 2003 pp. 107). This network structure is further re-enforced by the aforementioned satisficing where searchers will be content with only looking at the first few Google search results.<\/span><\/p>\n Rogers (2009, pp. 176) showed, drawing on a search query for terrorism, how the top Google results were often self-referential. The top ranked site is often Wikipedia, which has cited the top news sites, as they already had journalistic authority. Thus, in a recursive manner, both Wikipedia and the news-sites act as link-farms, promoting each-other to the top through PageRank link value. Metahaven (2009, pp. 189) also shows how Google\u2019s list of results, specifically the top 10, \u201c…harnesses a preference for sources, many of which have become authoritative for their social structure.\u201d<\/span><\/p>\n In another study, Hindman et al. (2003) crawled three million pages, organized the indexes in a manner similar to PageRank (links as votes), and then analyzed the link structure around controversial topics like abortion and gun control. They found that only a select few sites accrued most of the incoming links. These would be the top-viewed sites on search engine landing pages.<\/span><\/p>\n The scientific reference based voting structure of the PageRank algorithm constitutes a technological filtering that leads to the suppression of an estimated 80% of the information on the WWW (Ratzan, 2006). This 80% constitutes much of the \u201cdeep web.\u201d It is a \u201c…form of power both more sneaky and more structural than old-fashioned coercion\u201d, which \u201c…suppresses alternatives without coercion being needed (186 Metahaven). Indeed, vote-based hyperlinking is a system of control without a face, one which is difficult to define unless one is technologically literate, but one that humans have themselves created, and must therefore adhere to. PageRank is a prime example of Lawrence Lessig\u2019s (2006) famous quote \u201ccode is law,\u201d where hardware and software define what we can and cannot do. Like laws in the real world \u201c[c]ode is never found; it is only ever made, and only ever made by us<\/span><\/span>\u201d (Lessig, 2006).<\/span><\/p>\n <\/p>\n <\/span><\/p>\n Government pressure on Google in Europe and North America – aside from corporate pressure which I will address later – has minimal impact on the daily lives of most citizens. At present, these are the countries that, for the most part, aren\u2019t overtly affected by having a centralized server search engine like Google. The citizens living under authoritarian regimes are the ones who are affected by Government search censorship. Since China is the largest search market in the world (Internet World Stats, 2012) and because of Google\u2019s well documented tension with their government, I will use Chinese Government censorship as a case study to exemplify the problems citizens face when attempting search in a hostile environment.<\/span><\/p>\n China first uses a \u201cporous network of internet routers\u201d that filter blacklisted keywords. This is more widely known as the \u201cGreat Firewall of China.\u201d Like many other governments around the world, they use basic filtering techniques such as Domain Name System (DNS) tampering and Internet Protocol (IP) blocking in their firewall servers (OpenNet Initiative, 2009a<\/span>). Unique to Chinese censorship is the practice of TCP reset filtering. Routers in the firewall identify blacklisted keywords that were typed by internet users then break the connection from the user\u2019s intended destination back to the user (OpenNet Initiative, 2009a<\/span>).<\/span><\/span><\/p>\n For websites residing outside of the firewall, this means that any term on their site deemed controversial by Chinese authorities will not be let into the country. This greatly affected Chinese search results on Google and is why, in 2006, Google obtained a Chinese business license to launch a Google.cn domain (Human Rights Watch, 2006<\/span>). Residing within the firewall meant that search queries didn\u2019t have to pass through the entirety of the firewall. Unfortunately, it also meant that Google.cn would have to \u201c…police their own content under the penalty of fines, shutdown and criminal liability\u201d (OpenNet Initiative, 2009b<\/span>). <\/span><\/span><\/p>\n China forces all Internet Content Providers (ICPs) to obtain a license before they can legally provide access to online content. A condition of the license is that the ICP must \u201c…prevent the appearance of politically objectionable content through automated means, or to police content being uploaded by users for unacceptable material\u201d (Human Rights Watch, 2006<\/span>). To do so, companies employ people to create \u201cblock-lists\u201d of what they expect the government to find objectionable. This type of \u201cintermediary liability\u201d is referred to by MacKinnon (2012, pp. 241) as \u201cnetworked authoritarianism.\u201d<\/span><\/span><\/p>\n More ominous than filtering through the \u201cGreat Firewall\u201d and ICP liability is the widespread implementation of the Green Dam software in China. It<\/span> blocks \u201caccess to a wide range of web sites based on keywords and image processing, including porn, gaming, gay content, religious sites and political themes\u201d (OpenNet Initiative, 2009a<\/span>). It also has the ability to monitor computer behaviour on a personal PC level; it can view keystrokes done in software like Microsoft Word and then terminate the application if it detects blacklisted words (OpenNet Initiative, 2009a<\/span>).<\/span><\/span><\/p>\n In essence, the Green Dam software acts in a similar manner to P2P distributed networks in that it functions at a computer\u2019s local level to filter sensitive material. Where distributed computing is often touted as a democratizing technology, China realized its potential through the lens of censorship. <\/span><\/span><\/p>\n Fortunately, the use of the Green Dam software in personal PC\u2019s was never fully realized. The Minister of Industry and Information Technology announced that the aforementioned mandatory installation of the software would not be put into legal effect (<\/span>OpenNet Initiative, 2009a<\/span>). However, public computers such as those in Schools and Internet Cafes would still be required to have the software installed (<\/span>OpenNet Initiative, 2009a<\/span>). This is troubling considering that 42 percent of Chinese computer users access the internet from cafes (China Internet Network Information Center. 2009).<\/span><\/span><\/span><\/p>\n <\/p>\n <\/span><\/p>\n As exemplified in the previous section, it is very difficult for average Chinese citizens to retrieve search results that aren\u2019t heavily influenced by Government control. The Chinese government has tight control over what information goes in and out of the country. Unlike Government censorship, which generally only affects what residents of a particular country can see, economic censorship has global reach – most notably through the Digital Millennium Copyright Act (DMCA) provisions aimed at combating illegal file-sharing, which have turned search engines into an \u201cinstrument of international power\u201d (Halavais, 2008, pp. 129). DMCA takedown notices are the number one reason for content removal on Google, with 97% of the 3.3 million requests in 2011 being complied with (Rushe, 2012). <\/span><\/span><\/p>\n As mentioned previously, Google nearly always takes content down when a valid court order is issued. They\u2019ve even gone so far as to \u201cdowngrade websites that persistently breach copyright laws\u201d (Google, 2012b). Search results which link to files are taken down globally. When one search engine, such as Google, dominates the market by such a large margin, the fact that one country\u2019s legal requests can affect what citizens in other countries access is cause for concern. This is not to say that search engines are made for illegally sharing content, my point is to highlight the fact that corporate entities can have plenty of power over information in the Networked Information Economy, even when they no longer own the means of production.<\/span><\/span><\/p>\n <\/p>\n <\/span><\/span><\/p>\n One of Google\u2019s mantras is \u201cto give you exactly the information you want right when you want it\u201d (Google, 2008). They drastically improved searching online, but not everyone searches for the same results – people using the same query might be looking for different information. To tackle this problem, Google began to \u201c…personalize search in order to deliver more relevant results to the users\u201d (Feuze, Fuller and Stalder, 2011). Now, \u201c…results are tailored to one\u2019s tastes, based on search history and results clicked\u201d (Rogers, 2009, pp. 180). <\/span><\/span><\/p>\n An upside to personalization is that it diminishes the effect that the power-law structure of PageRank has on results. By factoring in past searches and user interests, the search results are less oriented towards the most popular results on the web and more oriented towards outlying sites that fit well with a user\u2019s taste profile. Personalization relieves Google of accountability for results returned. Users have partly themselves to blame for what their search query returns. <\/span><\/span><\/p>\n The downside to personalization is that it diminishes autonomy in search, constituting \u201c…an obscure iron numeric cage that constrains users\u2019 freedom and their capacities of determination\u201d (Lobet-Maris, 2009, pp. 81). Feuze et al. (2011) showed how, using three newly created Google accounts populated with a combined 195 812 individual search queries, personalized search results developed over a period of time. They found that over only a small amount of time the personalized results were glaringly different between the three accounts. They also found that:<\/span><\/span><\/p>\n \u201c…Google is actively matching people to groups, which are produced statistically, thus giving people not only the results they want (based in what Google knows about them for a fact), but also generates results that Google thinks might be good to users (or advertisers) thus more or less subtly pushing users to see the world according to criteria pre\u2013defined by Google.\u201d<\/i><\/span><\/span><\/p>\n As R\u00d6<\/span>hle (2007) points out, such personalization \u201c…implies an expansion of surveillance in the interests of commercial actors.\u201d In one of the first papers Sergei Brin and Larry Page wrote about Google they deplored mixing advertising with search, stating that the \u201c…issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm\u201d (Brin and Page, 1998). Contrary to this, the integration of personalization as a key ingredient in Google search is partly in place for commercial incentives, indicating that Google\u2019s search results are constrained by monetary persuasions. <\/span><\/span><\/p>\n <\/p>\n <\/span><\/p>\n As MacKinnon (2012, pp. 498-499) advocates, the best \u201ccounterweight\u201d to corporate power on the internet is a strong digital commons. \u201cA robust digital commons is vital to ensure that the power of citizens on the Internet is not ultimately overcome by the power of corporations and governments\u201d (Mackinnon, 2012, pp. 508-513). Being gatekeepers of information, search engines form an integral part of the digital commons. They are an opportunity to realize the original democratic potential of the internet; a system that allows anyone to share information freely from node to node, without coercion in between nodes. As I will show in section 3.2.2 the greatest way of achieving this via search is to distribute it with a robust P2P network.<\/span><\/span><\/p>\n From the perspective of fostering a stronger digital commons, the most important components of a search engine are the ones that allocate users greater autonomy in choosing what they search, how they search, and what results they find, free of control from certain technological, government, and corporate constraints (R\u00d6hle, 2009, pp. 129). The five requirements for a digital commons based democratic search engine are that it be free from:<\/span><\/span><\/p>\n <\/p>\n <\/span><\/p>\n There are many other search engines trying to solve these problems in one way or another. Although I am focusing on Google, a brief outline of these other sites is an important segue into P2P search. Google dominates the search environment, so to compete with it most other search engines either try to copy it or capitalize on a niche market that Google has missed. These niches often involve attempts to resolve some of the issues I have outlined. For example, Yippy divides results into thematic clusters in the hope that users can find deeper associations between topics, thus relieving some of the deep web suppression that Google\u2019s algorithms cause. Also, they don\u2019t track users, thus safe-guarding their privacy. <\/span><\/span><\/p>\n Although not the main premise of Yippy, its function represents an opposition to Google\u2019s way of suppressing the deep web. This fits well with the ideal that a more democratic search engine should make it so that users are \u201c…able to grasp where the borders between…social currents persist, and where they consent or diverge\u201d (Metahaven, 2009, pp. 196). DuckDuckGo takes a different approach by focusing on popping the users Google induced \u201cfilter bubble\u201d and ensuring complete privacy in the process – it doesn\u2019t create user logs (DuckDuckGo, 2012). Unfortunately, they are both still conducive to government and global economic censorship[6].<\/span><\/span><\/p>\n There are also \u201csocial\u201d search engines that return queries based off of collaborative filtering. Swicki is a mix between a wiki and a search engine that relies on users to build search databases around certain subject areas (R\u00d6<\/span>hle, 2007). The Open Project is a web directory edited by users, somewhat similar to Yahoo!\u2019s original premise. With more than 60 000 editors it is comprised of more than 4.5 million websites (R\u00d6<\/span>hle, 2007). Social search sites address the issue of technological filtering but fail to address any of the other issues of censorship. In fact, they appear to be more interested in further personalizing search, which is good for personal autonomy and democracy, but only within small clusters. [7]<\/span><\/span><\/p>\n <\/p>\n <\/span><\/p>\n In this final section I will argue that Peer-to-Peer (P2P) search is currently the only method of search capable of resolving the five issues I have outlined[8]. I will then use the P2P search engine YaCy as a case study to test my claims. The analysis will be qualitative and observational. I will use the small amount of documentary evidence available \u2013 both from YaCy itself and its users – and my own observations to analyze YaCy in relation to Google search. But first we must understand the fundamental characteristics of P2P. <\/span><\/span><\/p>\n A P2P network is a collectively produced structure of information that is accomplished through the formation of a \u201cchain of interconnected applications” by individual users (known as peers or nodes), who share both resources and personal computing power (Rigi, 2012; Loban, 2004). A P2P network does not rely on centralized network servers like in a traditional web-client\/server system (Svensson and Bannister, 2004). This \u201cdecentralization\u201d means that nodes in a P2P network only share information with other nodes, the result of which is a network without a centre. <\/span><\/span><\/p>\n The wealth produced by a P2P network, through a distribution of labour (Rigi, 2012), cannot be capitalized on for monetary purposes. The immaterial labour essentially propagates itself, leading to a network more democratic in nature – and more closely related in structure to the early days of the Internet – than traditional central server networks such as the WWW. The potential power of distributed computing networks like P2P are vastly superior to other models, but they depend on how many people decide to participate. For example, Seti@home is a distributed NASA initiative that uses people\u2019s spare computing power to compute signals from space, in search of extra-terrestrial communications (SETI@Home, 2012). It holds the Guinness world record for the largest data computation in history (Newport, 2005). If the distributed power of Seti@home\u2019s magnitude could be used for web-search, not only would it alleviate the monetary constraints of maintaining large server farms, but it would also provide more computing power. <\/span><\/span><\/p>\n There are currently three different distributed P2P approaches to web search. FAROO is the largest of the three with more than 2.5 million peers (FAROO, 2008). Users install the software on their computer. Once active, every time they visit a website it is logged in an index. The index is then used to automatically compile and adjust – in real-time – search rankings across the FAROO network. Because it ranks pages automatically when a user visits a website, it is simple and democratic. Unfortunately, FAROO is relatively new so there is a lack of in-depth understanding of its inner-workings and their implications. According to their website, FAROO retains no search logs and is immune to censorship because of encrypted search queries and indexes. However, they also mention that they use \u201cprivacy protected behavioural advertising,\u201d which renders it a proprietary software. Advertising in FAROO may not affect search results like it does in Google, but there is still surveillance in the pursuit of monetary gain, capitalizing once again on the wealth of networks, this time the wealth being a composition of its 2.5 million users. The lack of server costs associated with P2P networks negates the need for advertising revenue beyond what needs to be paid to the developers for coding and upkeep. There is no doubt of the democratic potential and innovative search mechanisms of FAROO, but its inclusion of advertising is a point of contention. It is important to mention that FAROO’s proprietary P2P network is not considered \u201cfree software\u201d like most other platforms based off of P2P networks. Though P2P software and Free Software often display similar attributes, they are not always interchangeable \u2013 as is the case with FAROO. <\/span><\/span><\/p>\n Seeks is another P2P web search application that can be installed on a personal computer. Like FAROO, it builds a social search network around users\u2019 online activity (Seeks, 2011). Based on search queries through other search engines, what results are clicked, and web search, Seeks builds a personal profile that is stored on the user\u2019s personal computer. This profile is used to filter search results. Personal profiles are placed into groups according to shared interests which further filters results. Within these groups users can filter even further in a collaborative manner and also comment on queries. Users can also link their own website or a company to any group or interest so that it is more likely for other users with shared interests to find that particular site (Seeks, 2011). Taking all of this into account, Seeks is more of a P2P social network built around searching than a distributed search engine. It offers plenty of autonomy for the individual user but scrapes its results from other search engines, meaning that Government and global economic censorship of Seeks\u2019 results would be the same as censorship of central-server search engines.<\/span><\/span><\/p>\n <\/p>\n <\/b><\/span><\/span><\/p>\n Out of the three P2P search applications, YaCy most closely resembles a traditional search engine. Once installed on a user\u2019s computer, they gain the tools necessary to participate in every aspect of a traditional search engine. Running in private mode, a user can use the built in crawler to crawl the websites of their choice, the regularity of which can be set so that indexes are always up-to-date. Once crawled, the indexes are stored on their personal computer. They can then search their local index, which is comprised of only the sites they themselves have crawled (YaCy, 2012a). Combined with the ability to adjust weighting to a wide-variety of different ranking parameters, YaCy can essentially act as a personal search engine, with nearly full customization.<\/span><\/span><\/p>\n In public mode, the user\u2019s crawl index is shared with every other peer acting in public mode, fittingly called the \u201cfreeworld.\u201d Unlike Google, which stores its search indexes in central servers, Yacy stores different small bits of its index in every single users\u2019 hard-drive. \u201cThere is no central body of control in YaCy\u201d (YaCy, 2012b). Indexes are encrypted with a key and then placed into a Distributed Hash Table (DHT) which shares information with other nodes in the \u201cfreeworld.\u201d \u201cThis allows index data to reach the peer before a query for that information is even submitted\u201d (YaCy, 2012b), meaning that every peer has a miniature version of the entire index in their hard-drive. When information is missing the peer will \u201ccall out\u201d to other peers in the network for the particular information. These queries and requests traveling through the freeworld are all encrypted to safeguard user privacy (YaCy, 2012c). <\/span><\/span><\/p>\n <\/p>\n <\/b><\/span><\/span><\/p>\n 1. Technological filtering induced by the PageRank-like algorithms<\/span><\/p>\n Although like many other search engines, YaCy attempts to emulate the PageRank algorithm, it doesn\u2019t put as much weight on the PageRank score that it does for more traditional types of index ranking. The dearth of user settings gives users more autonomy to add their choice of weighting structure to their personal index[9]. \u201c[E]veryone can assess the quality and importance of web pages by their own rules and adjust to their personal relevance as a ranking method…\u201d(YaCy, 2012c). <\/span><\/span><\/p>\n Building on top of more traditional indexing methods, and since the global index in YaCy is a compilation of only what users have crawled, there is a wider variety of content than the organizational-web-dominated Google results. Because of this not only will the Power Law structure of PageRank results be less prominent, but there will also be an attenuation of deep web suppression[10].<\/span><\/span><\/p>\n 2. Government censorship<\/span><\/p>\n Unfortunately, as much as YaCy claims to be a fully distributed P2P network, its multi-agent structure, using DHTs, means it must rely on four predefined severs to coordinate node lists, much like Torrent files rely on trackers for downloading (<\/span>Rudomilov and Jelenik, 2011<\/span>). This is not to take anything away from YaCy, as no global network is fully distributed, but it does mean that it isn\u2019t fully immune to Government censorship. However, shutting down the server lists would be difficult, as they are likely stored in different locations to avoid such risks. <\/span><\/span><\/span><\/p>\n Also, since there is no central control and no storage of the index itself in central servers, there is no one place from which to censor particular results. To censor content a Government would need to push other countries to help them censor the four node lists or go after thousands of active peers. \u201cYaCy results can not be censored as no single central authority is responsible for them and there are thousands of servers (personal computers) in multiple countries providing results (YaCy, 2012a). <\/span><\/span><\/p>\n The most comprehensive way for a Government to censor YaCy is to use distributed methods such as the Green Dam initiative. If the Chinese authorities had been successful in deploying it across the country it would have cut off access to the benefits of YaCy on a local level. For example, a user typing Falun Gong into the YaCy search box would be blocked by a keystroke censor, thus destroying the search.<\/span><\/span><\/p>\n 3. Excessive economic censorship by copyright holders<\/span><\/p>\n Just like with Government censorship, corporations have no central servers to approach in regards to removing content. In fact, once content is placed in YaCy\u2019s DHT, it is out of a corporation\u2019s control indefinitely, as pieces of the freeworld index are stored in every node\u2019s computer. Where Google removes the links to copyrighted material in its search results if issued a DMCA take-down notice, YaCy – because there is no central server – is still able to make such links visible. There is no one node to target.<\/span><\/span><\/p>\n That being said, such files are usually located on locker-box type music sites, video streaming sites, or torrent search engines\/trackers. YaCy is simply a search engine that would crawl these sites. In this respect, it can only bypass economic censorship so far as the sites of the original upload can. However, because it can scour deeper parts of the web, it is likely that it can find more obscure file uploads not targeted by copyright holders.<\/span><\/span><\/p>\n 4. The filter bubble induced by personalization, behavioural advertising, and monetary incentives<\/span><\/p>\n Like Google, YaCy presents its users with a filter bubble. Yet, the abundance of adjustable ranking parameters and weighting in YaCy allows the user much more autonomy in how their filter bubble is defined, and conversely how the filter bubble defines them. <\/span><\/span><\/p>\n YaCy, once again because it doesn\u2019t rely on central servers, and has a small development team, doesn\u2019t need to make abundant amounts of money to keep servers running. It\u2019s also funded by the Free Software Foundation Europe (FSFE). By relying on users\u2019 spare processing power, they can save the vast sums of money that Google must pay to run their own. YaCy doesn\u2019t need to advertise, which means that no search returns are ever done within the constraints of behavioural advertising, where users are grouped according to interests. Much more of the search content is \u201c…determined by the users, not by commercial aspects of the Web portal operator\u201d (Yacy, 2012c). It also means that \u201csearch requests are never stored, monitored or evaluated for commercial purposes\u201d (YaCy, 2012c).<\/span><\/span><\/p>\nIntroduction<\/span><\/h2>\n
Search<\/span><\/h2>\n<\/p>\n
History of the search engine<\/h3>\n<\/p>\n
How Google won the search environment<\/h3>\n
Wealth of Networks<\/h3>\n
How Google monopolized the Wealth of Networks<\/h3>\n
Why a monopoly of the Wealth of Networks is problematic<\/span><\/h2>\n
Prone to censorship<\/h3>\n<\/p>\n
Technological filtering<\/h4>\n<\/p>\n
Government filtering<\/h4>\n
Economic filtering<\/h4>\n
Advertising and personalization<\/h3>\n
Alternatives to Google search<\/span><\/h2>\n
\n
Social search attempts. Other search engines<\/h3>\n
P2P Search<\/h3>\n
YaCy<\/h4>\n
How YaCy solves search problems<\/h4>\n