There have been recent reports about the supposed decline in p2p traffic and also talked about the difficulties in measuring p2p networks. There is an interesting paper by Stutzbach et al entitled ‘On Unbiased Sampling for Unstructured Peer-to-Peer Networks‘ – now the paper has some fairly technical bits in it, but you can still get a good guide as to how difficult this whole area is. For starters the authors dip into sociology to look at the issues with getting to grips with ‘hidden’ populations. Hidden populations are ones that we don’t really know the boundaries or size of the population and those within it prefer to remain anonymous. A classic example is the population of drug users. The authors pick one method to draw from – respondent-driven sampling – which is interesting because the method for this is p2p-like within its own structure; you start with a small seed of respondents then you get the respondents to identify more respondents and so on.
Gathering unbiased sample data of a p2p network is hard. The current methods have bias within them. The authors look to sampling tools that are based around a static system and adapt it to work with a dynamic network. As p2p networks are very dynamic (they tend to have a few peers have long sessions while the majority have very short sessions), there is a problem sampling. The authors illustrate this with an example;
“Suppose we wish to observe the number of files shared by peers. In this example system, half the peers are up all the time and have many files, while the other peers remain for around 1 minute and are immediately replaced by new short-lived peers who have few files. The technique used by most studies would observe the system for a long time and incorrectly conclude that most of the peers in the system have very few files.”
What they are saying is that if you discount the long-standing peers from being re-sampled, as they have already been covered once, then in each snap-shot of the system you sample, it will contain more and more short-lived peers, and so mess the results up. (Though it seems to me all this assumes a large population relative to the sample size, with most p2p networks are.) Thus the problem also presents the solution; the authors used a method that allows the same peer to be sampled at different points in time – they decoupled the sample from the session lengths:
“[O]ur approach will correctly select long-lived peers half the time and short-lived peers half the time. When the samples are examined, they will show that half of the peers in the system at any given moment have many files while half of the peers have few files, which is exactly correct.”
Which addressed one of the issues with sampling – but how did they adapt the static methods to a dynamic environment? A clever adaptation whereby they introduce backtracking into an the existing methodology of Metropolized Random Walk – and not surprisingly call the result, Metropolized Random Walk with Backtracking. Here’s how it works:
“We make an adaptation by maintaining a stack of visited peers. When the walk chooses a new peer to query, we push the peer’s address on the stack. If the query times out, we pop the address off the stack, and choose a new neighbour of the peer that is now on top of the stack. If all of a peer’s neighbours time out, we re-query that peer to get a fresh list of its neighbours. If the re-query also times out, we pop that peer from the stack as well, and so on. If the stack underflows, we consider the walk a failure. We do not count timed-out peers as a hop for the purposes of measuring the length of the walk.”
The authors then go on to present a pretty robust analysis of their method in action. In all the paper is both an interesting account of the difficulties of getting good data on p2p networks coupled with some inventive solutions to the problems therein.
Disclosure note: the authors of the paper cite research as being supported by National Science Foundation and Cisco Systems. This article was first published on the Catblog.