you are reading...
Geen categorie

How to find the data you’re looking for

One particular field of journalism is data journalism. Simon Rogers, Data Editor at Twitter, former editor of the Guardian’s award-winning Datablog and an instructor of the free online data driven journalism course, describes data journalism as a way of telling stories by using numbers. It brings stories that are in the public eye to life by showing the numbers behind the news. The data can be accompanied by visualizations, but they are only there in service of the story.

For as long as journalism has existed, the reporting of data has played a role as well. In the olden days data was often collected by using a notebook and a cassette recorder and journalists often had to rely solely on the research and analysis performed by statisticians. Over the years the techniques of data journalism have changed. Journalists have had much easier access to tools that help them gather data, such as Excel and Numbers, and easier access to tools that help visualize their data. In the digital age we now live in there has also been a wider spread of open data. Governments and other organizations that collect statistics around the world are publishing thousands of databases online, which has made it both easier and harder at the same time for journalists to find the data they are looking for. The search for data has become easier, because journalists can now browse through the Internet and search for the information they need. However, since so many datasets are now available to journalists and the public in general, it is also more difficult the find the ‘perfect’ dataset. What I mean by the ‘perfect’ dataset is a dataset that not only offers you the data you’re looking for to accompany your story, but that is also valid. This blog post will offer you, as journalists, guidelines on how to find this ‘perfect’ dataset yourselves.


How can you get data to support your story?

Quote: “Data journalism begins in one of two ways: either you have a question that needs data, or a dataset that needs questioning. Whichever it is, the compilation of data is what defines it as an act of data journalism”. – Paul Bradshaw

Paul Bradshaw is the Head of the Online Journalism MA at Birmingham City University, Visiting Professor at City University’s School of Journalism in London and also an instructor for the online data journalism course. His quote shows that a story can be either based on a question for which you need to search data or on a dataset which raises an interesting question that needs to be sorted out. This blog post will be focused on the first situation. You have a certain topic in mind and are looking for data to accompany your story. The first thing to do is ask yourself ‘What kind of data am I looking for?’. When you know what you are looking for, you can start searching for the data.

Where can you find data?

Of course you can collect your data by doing your own research, but in most cases you will probably not have the time or money to do that. Therefore, a quicker way to gather data would be to look for it online. As I have mentioned before, more organizations are publishing their data online. You can, for instance, go to a government website or the website of a national statistical service and find all sorts of data there. On this Wikipedia page you can find a list of national and international statistical services you could use to gather data. You can also look for information on the websites of international bodies, e.g. the website of the World Health Organization, the United Nations, the World Bank or the European Union.

How do you know if your dataset is valid?

When you have found a dataset, you need to make sure that the data is valid and does indeed support your story. So how do you know if your data is trustworthy? Rogers states that when you are relying on data that is collected by someone else, you need to check who collected it and when and how it was collected. Get in touch with the person who collected the data and ask them about it. Besides that, also try to find another source that has the same kind of data and compare that dataset with the one you found. These two steps are very important to determine whether your data is valid or not. Take for instance this example as described by TechTarget, that shows how the analysis of big data projects can go wrong. In this project researchers wanted to use Twitter feeds and other social media to predict the unemployment rate in the United States. They looked for words that pertained to unemployment, e.g. jobs, unemployment and classifieds, in tweets and posts on other social media. After that they looked for correlations between the number of words per month in this category and the unemployment rate of that month. During the project there was a sudden increase in the word count, so the researchers believed they were on to something. However, what they failed to notice was that Steve Jobs died in that same period they found an increase. Therefore, the number of tweets with ‘jobs’ in them were of course higher but not related to unemployment. If the researchers had looked more closely at what was happening during the time of their research, they would have known that the increase in words was unrelated to the unemployment rate. So it is important for you, as a journalist, to be aware that not all research is accurate and trustworthy. If you look for another dataset that says the same thing, the chances that you found a good, trustworthy dataset are higher. Furthermore, you need to be aware of how you interpret the data. Most mistakes about false data analysis are made by interpreting the data wrong. Look carefully at what the data is actually saying and not just at what you want or believe it is saying.


 Keep the five W’s in mind

The most important things to remember when you are trying to see if your dataset is valid are the five W’s, as described by Simon Rogers. Ask yourself these questions before you use the dataset you found.

  • Who: Where did the data come from?
  • What: What are you trying to say with your data?
  • When: How old is your data?
  • Where: Which situation is described by the collected data? An essential part of data journalism is to combine different datasets and create a new story. Simon Rogers has, for instance, combined the gun ownership and homicides over the world and made one supporting visual out of it.
  • Why: Why is the data you found interesting and what does the data mean?


In conclusion, the ‘perfect’ dataset will offer you the data you are looking for, that can accompany your story and that is also valid. This blog post has showed you how to find this dataset and how to determine if that dataset is valid. To summarize, you need to check who collected the data you found and when and how it was collected. Get in touch with that person and ask them questions about their data. When you found a dataset that could support your story, be aware that not all data is accurate and trustworthy. Try to look for another source with the same kind of data. The chances that your dataset is trustworthy are higher when you have another source that says the same thing. If you want your data to be valid, always keep the five W’s in mind. The five W’s offer you guidelines that can help determine whether the data can be trusted or not.



6 thoughts on “How to find the data you’re looking for

  1. I certainly agree with your point there can be a lot of misinterpretation about datasets, which causes information with wrong conclusions or context. You also give great tips for journalists how they have to deal with it, however, I’m wondering what you think of journalists who don’t have any experience with datasets. It could be possible their datasets are valid, but since they might have no experience at all with, for instance, statistics, they still will misinterpretate data. Do you think there is any solution for this problem?


    Posted by Maaike's blog | 11 November 2014, 10:58
  2. I think the 5 W questions are indeed a good guideline to ask some critical questions about any acquired data. But when it comes to the verification process, I can imagine, however, that in some cases finding out where some data came from might be more difficult compared to other types of information. Who should be contacted if you happen to have only some numbers in a spreadsheet? Maybe the model needs to be a bit more adjusted for online sources (as this feels a bit copied for the verification of offline sources). I’m not sure yet what should be changed, but it’s something to think about!

    In response to Maaike: if journalists don’t have time or are not willing to get into statistics, I think we could see a growing trend in the future of statisticians being asked for journalistic projects who are able to contribute to conclusions that are based on the data.


    Posted by mmvsedy | 11 November 2014, 16:35
    • I agree that when you’re dealing with online datasets, it would be more difficult to find out who originally published the data. However, I don’t think it is impossible and I don’t think there is that big of a difference between online and offline datasets in this case. You can always trace the data back to the original source and get in touch with the people behind that website or behind that organization, for example. And if you cannot trace the data back to any source, I would recommend to use another dataset. If you cannot find any source that talks about this data, it probably wasn’t that reliable to begin with.


      Posted by reflectionsonthewrittenword | 12 November 2014, 00:08
  3. In your post you say something about different sources of data and you give as example government websites or websites of health organisations. I wondered if you think this data is 100% reliable? In think there are health organisations for which it is important that the data shows them exactly what they want. I think this also applies for governments (I do not want to suggest Illuminati or something), there is data they really want to share with the citizens but there will be also data in whom this is not the case.


    Posted by mennobroeders | 11 November 2014, 23:36
    • Good point. I agree that not all data that is published by the government or health organizations are completely reliable. I think that is important to always check your data. Even if you believe the source itself is probably reliable, it is important, like you said, to keep in mind that organizations might have other intentions with the data. My advice would be to always, no matter where the data comes from, follow the guidelines I described and have a critical attitude towards the data you find.


      Posted by reflectionsonthewrittenword | 12 November 2014, 00:14

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s


Blog Stats

  • 454 hits
%d bloggers like this: