The Truth About Stats And Dogs (Or Why Most Surveys Are Wrong)

All statistics are hearsay, but some are reliable hearsay.

By Erik J. Heels

First published 5/1/1997; Law Practice Management magazine, “nothing.but.net” column; publisher: American Bar Association.

What system do you use for picking lottery numbers? Birthdays? Anniversaries? Shoe sizes? Here’s my system. I always pick the previous day’s winning numbers. If you’re thinking “Hmm, that’s dumb!” you’re right. But it’s no dumber than any other system for picking lottery numbers. My six numbers are just as likely to come up as your six numbers. “But but,” some may ask, “what are the odds that the same numbers will come up two days in a row?” That’s not what I’m betting on. I’m betting that my six numbers will come up once. The events that determine the winning numbers on Tuesday and on Wednesday are independent. The outcome of Tuesday’s lottery does not affect the outcome of Wednesday’s. It’s like flipping a coin. If you get heads after one flip, then the odds of getting heads on the second flip are exactly the same as before: 50/50. The idea that this is not so is called the Gambler’s Fallacy. Unfortunately, basic misunderstandings about probability, statistics, and surveys are creeping and crawling onto the Web. The result is that many reports about the growth of the Internet miss the mark.

Random Samples

Rule number one. Surveys must be based on samples that are random. If a survey is based on a random sample, then its results can be generalized to the population from which the sample was drawn. If it is not, they cannot. Consider the following example. On the weekend during which a movie theater is running a science fiction marathon, the movie theater’s management decides to survey its audience members about what kinds of movies the theater should show in the future. Lo and behold, the survey reveals that 95% of the audience members want to see more science fiction movies. So the theater holds another science fiction marathon the following weekend and conducts another survey, but this time 97% want more science fiction. After a few move weekends and a few more surveys, the theater is showing 100% science fiction, but (oddly enough) the audience size has been steadily decreasing. Eventually, the theater goes out of business. Why? Because each survey was less accurate than the previous one, each sample less random. And it turns out that there are only a few science fiction fanatics who want to attend science fiction marathons each and every weekend.

The above example may seem obvious. But others a not so obvious. Consider the city transportation authority that surveys its subway riders to determine how frequently people use the subway. The not so obvious problem with this survey is that if half of the city’s population uses the subway about once per week, and if the other half uses the subway five times per week, any survey of subway riders will always contain more of the second group of riders than of the first group. In order to have a random sample, the city must survey people at a location other than the subway.

Now consider surveys about Internet usage. If a survey about Net usage is conducted via the Net, then the conclusions drawn about Net usage will show that usage is higher than it actually is. A related problem is when surveys rely on a selected sample of users sending in their responses. Even if you have a random sample to begin with, you can destroy the reliability of a survey by relying on the nonrandom self-selected few who replied to your survey. If only 5% of survey recipients respond to a given survey, the survey will be skewed to the extremes, because those people are more likely to respond to surveys.

Imagine if a survey were mailed to 10,000 households asking one simple questions: Do you like Microsoft, yes or no? About 9,000 people would probably throw the survey in the circular file, 50 (the Microsoft employees) would respond yes, and 50 (the Netscape employees) would respond no. The surveyors could (wrongly) conclude that 50% of the households love Microsoft and 50% hate it. When the truth is that 99% have not (yet) expressed an opinion.

Know When to Say When

Rule Number Two. It does not take a large sample size to get good results. But the lower your response rate, the less random your sample, the less reliable your results. My statistics professor at MIT did survey work for the American Heart Association. The results of his surveys were used to determine how research dollars were spent. He did not track a large population – only 400 people – but he tracked all of them. Religiously. If one moved, my professor (or, more likely one of his students) would find him. When you’re tracking heart disease, you need to know if that one person died of a heart attack or moved. Why? Because the formula for determining the margin of error with 95% confidence for a survey is simple: the inverse of the square root of the number surveyed. Stick with me here. If 100 people are surveyed, then the percentage error is 10%. The square root of 100 is 10, and the inverse of 10 is 1/10 or 10%. Survey 400 people, 5% error. Survey 2500 people, 2% error. Survey 10,000 people, 1% error. Given a random survey, you can always conclude with 95% confidence that the margin of error will be 10% for a survey of 100 people. (If you’re curious about this, the 95% confidence rule and the simple formula for margin of error result from application of the Central Limit Theorem, which you can read about in any statistics textbook.)

For example, using this simple formula, Boardwatch Magazine concluded that seven percent of all Web sites turned their backgrounds black in February 1996 when the Communications Decency Act was passed (http://www.boardwatch.com/mag/96/APR/bwm31.htm).

There have been many surveys about the use of the Internet in general and, lately, by lawyers. Unfortunately, nearly all of these surveys were flawed because either the sample was not random, or because the random sample was rendered useless due to low response rate. So what do we know about the growth and use of the Internet?

Nothing But Stats

I believe that the Internet is growing, not exponentially but linearly. And I believe that Internet use by lawyers and non-lawyers is increasing. Finally, I believe that reliable sources of information exist today from which simple conclusions can be drawn about use and growth of the Net. Here are some of them.

Two times per year, Network Wizards conducts an electronic survey about how many computers are connected to the Internet (http://www.nw.com/zone/WWW/report.html). Although this survey (like all surveys) is imperfect, Network Wizards has been doing this for over four years, so it is possible to see trends. From January 1994 to January 1995, the number of host computers on the Net increased by 54%. From January 1995 to January 1996 the increase was 48%. From January 1996 to January 1997 the increase was 41%. We’re up to 16 million (or so) host computers on the Net, but the trends identified by Network Wizards suggest that growth is leveling off.

From April 1994 to April 1996, WebCrawler counted how may Web servers were in its database (http://webcrawler.com/WebCrawler/Facts/Size.html). In January 1995, WebCrawler counted about 15,000 Web servers. In January 1996, it counted about 75,000 Web servers. Assuming that the percentage of computers on the Net that are acting as Web servers has remained constant over the past year (which is a conservative estimate, the percentage has probably increased, but I want to err on the side of caution), there are probably about 128,000 Web servers on the Net today.

Alta Vista, one of the most powerful and largest Internet search engines, allows you to count how many sites external to a particular site link to that site. In other words, you can count how many sites other than your own link to your site. Using this method, you can determine that there are about 270,000 external links to Yahoo, about 20,000 external links to Cornell’s Legal Information Institute, about 3000 to Law Journal EXTRA!, and about 1500 to FindLaw. What does all of this mean? It means that Yahoo is very popular, averaging about two external links per Web site! It means that the most popular legal Web site – Cornell – is about 1/15th as popular as Yahoo. And it means that if you’ve got more than 1000 other Web sites linking to your site, then you’re doing a very good job of marketing your Web site.

And what about listings of lawyers on the Web? Well, the most comprehensive listing remains Yahoo’s listing of lawyers. As I mentioned last month, that listing is probably about 75% accurate, but there is also a backlog of law firms that are waiting for their Web sites to be added to the index that makes up for that inaccuracy. So the number of lawyers in Yahoo’s directory ends up being a reasonable estimate. As of January 1997, there were about 1700 law firms on the Web.

All statistics are hearsay, but some are reliable hearsay. In an attempt to catalog some of this reliably hearsay in one central place, I have created a “nothing but stats” page on my Web site. This page contains all the statistics mentioned above and should help you draw your own conclusions about the growth and use of the Net.

For Further Reference

There is surprisingly little information about statistics on the Net, but I did find one article from the United Kingdom that discusses survey methodology in detail (http://lispstat.alcd.soton.ac.uk/am306/quant1.txt). One good source of information on statistics is the magazine Chance published by the American Statistical Association (http://www.amstat.org/). An excellent reference is a book by John Allen Paulos entitled “Innumeracy” about “mathematical illiteracy” and its consequences. Like Strunk and White’s “The Elements of Style,” Paulos’s “Innumeracy” should be on the bookshelf of every publisher. And since the Web enables all of us to be publishers, perhaps those two books should be read by everybody! Well, one step at a time.

“Innumeracy: Mathematical Illiteracy and Its Consequences” by
by John Allen Paulos.

Leave a Reply

Your email address will not be published. Required fields are marked *