Nice to see they’re being so transparent about setting the record straight. Mistakes like this happen in the journalism business, but few outlets make a serious effort correct misreporting beyond burying corrections where nobody sees them.
Unfortunately they don't discuss two of the biggest disparities I've seen in reporting on broadband internet access:
1) How do they define "Access"? Does it mean actual subscriptions? Does it mean the building/home is connected? Or does it mean a line passes the household, but there's actually no way to connect to it? (Look up New York City's lawsuit against Verizon's FIOS rollout.)
2) How do they define "Broadband"? In 2010 the FCC defined it as 4 Mbit/s down, 1 Mbit/s up. In 2015 they redefined it as 25 Mbit/s down, 3 Mbit/s up. Currently I have 50Mbit/s down and 1Mbit/s up which Comcast absolutely defines as "Broadband", but it doesn't meet the FCC definition.
So if I'm reading this right, there are three data sources here:
1) US Census, which is based on surveying households "do you have broadband, Y/N"
2) FCC data, which is based on ISPs self-reporting (In a footnote the article says they're using Pai's new slower definition of broadband, 10Mbps, not the 2015 definition of 25Mbps.)
3) ASU/Iowa, which depends on a derived variable in commercially-purchased data which "denotes interest in ‘high tech’ products and/or services [including] personal computers and internet service providers" as a proxy for broadband ownership
...and the first two roughly match each other, while the third doesn't. The academics claim the company that sold them the data told them it was a reasonable proxy for broadband, the company says they didn't say that.
I just wanted to comment that I think this is brilliant and the kind of analysis and general skeptical of data we should see more of.
Just for context,of its not obvious, I work with data. Both putting it together and analyzing it. And one of my chief frustrations with academia (and biggest lessons to people I advise) is a kind of "cultural reverence for the data set".
Just because data is collected, in no way does that assure that's it's right or suitable, even if the valuable name says it is.
Be skeptics. Private suppliers have incentive to sell you data. Private industries have incentives to keep data from you (it constitutes competitive advantage). Government data has political interference on what is collected, even if you're lucky enough to live in a world where the actual collection is independent and rigorous. Reporters and responses to surveys and interviews may be innacurate even when people thought they were being honest, and on socially contentious topics they usually don't have that.
And even if you managed to avoid all that, it doesn't mean your data isn't problematic. Our census in my country, for example, is done in the winter time. How good is that at tracking information in seasonal towns?
Proper data collection is some of the hardest work you can do, and proper analysis comes from measuring, corroborating, justifying, hypothesising on the data you have. It does not involve just calculating a stat or, god forbid, just testing things for statistical significance just because it's on your data set.
For all those reasons, I highly commend this article. We need more of it.
I'm really confused by this : surely if you want to make a data set that has Internet speed binned by county surely the way to do that is as follows:
1. Go to a large Internet services provider (Amazon, Google, Akamai, Netflix).
2. Ask them to statistically sample the TCP flow rate observed in client traffic, by source IP address.
3. Get a data set that geolocates IP addresses to ZIP code (Amazon for example has this data).
4. Join the two.
Responsible adults; I approve.
Notice we applaud the careful report on a research report in exactly the same way we applaud the post-outage report.
Personally - there is something that I would like:
A monitor of exactly how much traffic is used by ads vs content.
So if I load a page, and say that page is just an article with text. What % of the content is the bandwidth-consumption I am interested vs the ads surrounding it?
The reason why this number is important is Mobile.
So a user signs up for "3 gigs of data" - how much of that 3 gigs is consumed by ads and shit they dont want/need?
Actually - it would be good to have a standard on reporting for any given page "this page weighs in at 50KB for content and 500KB for ads..."
Does this exist?
I don't recall which FCC action it was last year, but as I recall, the large providers are no longer required to at least show they attempted to offer broadband to all households.
Previously they had to show state and fed gov't this info. Now they get to concentrate on providing access to the most profitable while ignoring the less profitable.
Kudos to FiveThirtyEight on being transparent and analyzing what happened. But also...this was a series of mistakes, some of them pretty scary.
FiveThirtyEight's biggest mistake seems to be trusting an academic dataset when they had no idea how it was collected. This is understandable, especially when the data was published on the Arizona State University's Center for Policy Informatics data portal. (You can go there right now and download the bad data - scroll to CATALIST DATA here https://policyinformatics.asu.edu/broadband-data-portal/data...) A university should be a trusted source. But FiveThirtyEight took an unbelievable outlier from this dataset and wrote an entire post about it (https://fivethirtyeight.com/features/lots-of-people-in-citie...). The dataset claims that only 29% of Washington D.C.'s adults have broadband. (The real number according to the other datasets FiveThirtyEight looked at in the new post is closer to 70%.) They even make a point of how extreme the Washington D.C. datapoint is on the histogram in the article as the only large county with such a low percent. That should be a clue to question your data.
What I find worse is that the academic researchers published this dataset. They bought behavioral marketing data and trusted a salesperson that the variable HTIA (“Denotes interest in ‘high tech’ products and/or services as reported via Share Force. This would include personal computers and internet service providers. Blended with modeled data.”) was a good proxy for broadband access. To be clear, HTIA includes modeled data, which means they took demographics, voting records, and whatever other individual data they could grab (maybe they have records of your purchases, I'm just guessing), and predicted whether each adult in the US was interested in tech. This is the kind of data companies buy for ad campaigns, figuring that if they advertise to these adults, it might be better than random. There's no reason to think the aggregates of these numbers would be accurate or calibrated correctly, especially for an entirely different purpose (broadband vs high tech).
It's disturbing that these sort of datasets are floating out there in academia and really makes you wonder what other bad data is being blindly trusted to write blog posts, research papers, and news articles.
Does anyone here know if there is a way to opt-out of being in the Catalist dataset?
I few things I don't like about this;
"After further reporting, we can no longer vouch for the academics’ data set. The preponderance of evidence we’ve collected has led us to conclude that it is fundamentally flawed.... The idea behind the stories was to demonstrate that broadband is not ubiquitous in the U.S. today, even as more of our lives and the economy go online. We stand by this sentiment and the on-the-ground reporting in the two stories even though we have lost confidence in the data set."
If the data you used to reach a conclusion is fundamentally flawed, it's pretty disingenuous to claim you stand by the sentiment. So they started with a conclusion, set out to prove it, later found the data they used to prove it was flawed, but still believe it's true.
The second thing I don't like is it seems readers are very confused between access and usage and their sloppy wording often conflates the two. It appears they were studying usage (actual subscriptions) not access (availability of a high speed connection).
Lastly, they also seem to disregard an LTE wireless connection as usage of broadband, when I would have assumed it would clearly be considered. If LTE wireless is more commonly used as a form of access to broadband internet in certain areas (i.e. rural areas where density can't justify running the fiber, or dense metro where the LTE is so good there's no need for a wire) then not surprising you'll find broadband "usage" is low in those areas, even if those households are absolutely using broadband internet through an LTE hotspot.
Why does every title have to be clickbait nowadays. Can't help but feel like it takes away from the legitimacy of a post if you have to use leading titles/clickbait.
I'm probably being too harsh but...
Good on them for writing this, it's important to admit when you're wrong. However, I feel like this outlet has a larger responsibility to be actual data analysts as well as journalists (over say a more traditional journalist for a more traditional news outlet). As such, why was the analysis done in the postmortem article not done prior to publishing the original articles? A good analyst is one you can trust, and trust for an analyst is built by drawing conclusions from highly defensible data, and highly defensible data is data which has undergone sever scrutiny of the analyst before conclusions are drawn, not after. Also, they should probably update the now erroneous articles with a disclaimer indicating that much of the research is now invalid.