1 – Answer
The answer is simply around 6% or 1055 domains the 23rd April 2016.
2 – Introduction
The simple question used as a title for this article raises an interesting problem, that mixes experimental analysis, statistics and computer science. We will try to detail how to obtain the percentage of developed websites for the specific case of LLL.in domains, but the method can be applied to any domains category.
First of all, we are working on 26 x 26 x 26 = 17,576 domains. A number that is too big to be analysed manually i.e. by looking at each domain one by one. We need to find alternative ways to analyse them.
3 – Identify parked domains
To reduce the total number of domains to analyse, we can start by removing parked domains from the initial bunch of 17,576 domains. To do this, the first step is to retrieve the nameservers of each domain. Using the nameservers, we will be able to identify all parked domains. Indeed, the nameservers of these domains will contain well known parking companies like in NS1.SEDOPARKING.COM, NS1.PARKINGCREW.NET or NS1.BODIS.COM for instance. In order to list the nameservers of each LLL.in domain, we can either perform a ‘whois’ command of each of the 17,576 (the long, complex and most precise way), or scan directly the ‘Start Of Authority’ (SOA) field of the zone file by asking to a domain name server (the fast, simple, and somewhat sketchy way). More info on the zone file and SOA field here: Zone file.
For LLL.in, we performed this analysis (the 23rd April 2016, using the ‘whois’ approach), and we obtained the following result, aggregated for the top 15 most occurring nameservers fields:
|Rank||Number of LLL.in||Nameserver value||Parking company?|
Using this first result, we can sum the number of domains parked at the main parking companies (lines with ‘yes‘ in last column of the above table): 2,044 + 1,824 + 1,818 + 1,669 + 861 + 663 + 237 = 9,116. This means that at least 9,116 or 51.9% of LLL.in domains are parked the 23rd April 2016.
Others domains can be parked at smaller or unknown companies, can be not resolving, can be redirecting or can be developed websites. This last category includes domains under construction, domains with a unique simple or void page or real developed domains. And this is this last sub-category that will be of interest for us in the next parts.
4 – Determine a representative sample
We are now left with 17,576 – 9,116 = 8,460 domains to analyse. Again, this number is too big and we can’t look at each domain to check if it is resolving, if it is parked, developed, etc… So, in this second step, we will use a statistical sample to evaluate, with a given and known precision, the number of developed domains. In other words, we will look manually at a small group of domains, and we will extrapolate the result to the whole group. But the ‘small’ group needs to be ‘big enough’ to have a relevant result. For more details, you can have a look at the following pages: Sample size determination and Sample size calculator.
Let’s say we want to evaluate the number of developed domains with an error margin of 3%, and with a confident factor of 95%. In this case, we need to look at 948 LLL.in domains chosen randomly within our group of 8,460 non-parked domains.
5 – Bulk screenshot of domains
Again, looking at 948 domains manually is not an easy task, since we have to open an url and wait for the page loading 948 times… Instead of this, we created a small tool that takes screenshots of websites from a list a domains (using the Selenium tool and the PhantomJS webdriver. More info here: http://stackoverflow.com/questions/18067021/how-do-i-generate-a-png-file-w-selenium-phantomjs-from-a-string). Like this, our final task is only to review 948 pics, to determine which domains are really developed.
After our analyse, from the 948 domains chosen randomly, 118 or 12.4% were developed websites. Thus, we can extrapolate that 12.4% of the our initial group of 8,460 domains are developed websites. That corresponds to 1,049 developed website over 8,460.
6 – Conclusion
The last step of our study is to combine the result obtained in part 5 (12.4% of the non-parked domains are developed) with result of part 3 (9,116 domains are parked). We can now simply conclude that 1,049 over 17,576 LLL.in domains are developed. This represents a percentage of approximatively 6%.
We are not an IT experts nor mathematicians. Feel free to report any error in our study. All pics and software parts are available on demand.