(sidd puts soapbox on the ground and climbs on it)
as I, and others, have said ad nauseam
'calculating the number of viewers from webserver logs is like
a radio station trying to measure the number of listeners from
the power broadcast by the antenna'
1 a) they don't sample all viewers is certainly true...
but if you do good sampling and have a
nice 'smooth' population, good statistics will save you
but see below
1 b) they say they are sampling from the most visited sites...
and then they trot out their own numbers to prove it
but I think in the 1 Megahit+ a day class and over they probably
are sampling at the servers at exodus and the rest of the colocation
areas and may even have an idea
this is because bandwith in the 1Mhit/day class is still not widely
available... rite now they may have enuf sampling sites
at major ISPs where this kinda bandwidth is available
... but large bandwidth is coming and as it approaches their numbers
will get further away from reality ... unless they put a sampling
box on every machine in the world
and, we wont talk about the distributed services like
Gnutella, Napster, and Freenet to which the methods cannot apply
but, they are almost certainly undersampling the sites that get
a 0.1 Megahits/day a day .. like us
but, I don't think that 1 a) holds .. the internet user population
is not smooth ... but is rather a class of disparate groups
like on Usenet, with limited overlapping interest so perhaps
while their numbers might hold for generalist sites like yahoo, they
don't apply to sites like ours .. which is a collection of disparate interest
groups, and micromanaged traffic-- in fact our sites are really
more like Usenet-newsgroups/email than the web
the Web itself is not a smooth
homogenous structure, it has blobs and tendrils and a rich structure --
almost
biological in complexity. there are huge sites inside corporations and
universities, that transfer petabytes of data daily, and are not
even mentioned by the net ratings gods
so, you have a fragmented medium and a fragmented user base...
you have little chance of estimating viewership from serverlogs
what works is microtargeting, as we well know
but, all this is just not relevant.. the web is a fine thing, but the killer
app is email
the web is a broadcast medium where the cost of entering the market is
near zero so... you will eventually have as many websites as users
e.g. the number of porn sites is growing faster than the
number of porn viewers
heehee... take that and put it in your models.
old broadcast models developed for tv and radio need not apply
email is not a broadcast medium (misused as a broadcast medium
we call it spamming) is targeted, contextual and many times more effective
than a website
email if you think about it is the logical extension of the
fragmentation of the web except that it came first !
(gets off soapbox)
so, i think that point i have expanded on points 1,2 ...
leave you to deal with 3...
cookies,
java and udder attempts to track actual users, and so on....