Before we start collecting data, we should also draft a first version of our data dictionary. A data dictionary is a list of the data (in terms of variables) along with their definition, format, and source. While it sounds like a trivial housekeeping task at first, drafting a data dictionary early on is a key tool for avoiding confusion and keeping your goal in always in sight. In the following, we will go through the advantages of keeping a data dictionary and key elements of a data dictionary.
Why you should maintain a data dictionary
A data dictionary is one of the most valuable artifacts throughout the data collection and analysis process. It has three main advantages. First, a data dictionary serves as a checklist for the data collection process. It tells us which data needs to be collected from where and in what format it needs to be available in the analysis. Second, a data dictionary serves as a communication device. When several people are involved in the research process (e.g., research assistants, senior professors), the data dictionary helps answer questions like „what else do we have?“ or „what does this variable mean?“. Once you have a data dictionary, it is a document that everybody can reference and that makes onboarding new researchers easy. Finally, many scientific journals require to have data provided after publication along with an explanation on how the data can be used to replicate the findings. A data dictionary is a key input for ensuring replicability.
Here’s a data dictionary for an exemplary research project.
|price||Price of the fruit in US$||5.99||long||Fruit facts database|
|name||Name of the fruit||Mango||String||Fruit master directory|
The first column names the piece of data, which we call a variable. Variables should be one word. Don’t use spaces, special characters or upper case letter since computers typically have problems with that. The variable name is later used in our crawler and scrapers, as well as in the statistics software for the analysis. Be precise in your naming in the variable.
The second column provides a definition. Definitions should be short and unambiguous. If it is clearer, then show the formula behind the variable rather than write out an otherwise complex definition.
The third column provides an easy to understand example on how an actual data record looks like.
The fourth column states the format of the variable. The format tells us whether the variable is, for example, a number or a text. Being precise with the format is important for statistical analyses and precision. There are typically five different format types:
- String: text (e.g., „car“, „house“)
- Integer: whole numbers (e.g., „1“, „2“, „100“)
- Double: double-precision floating point number values (e.g., „1.00002“, „123.98“)
- Date: denotes a calendar date (e.g., „January 1, 2019“, „12-08-19“)
- Boolean: binary (i.e., „0“ or „1“, sometimes also „true“ or „false“)
The fifth column gives the data source. The data source denotes the origin of a data, for example a database that we purchased or a web page that we have crawled. Maintaining the source is important because it helps us to keep track of where the data comes from and ease replication or verification checks.
You may also want to add further information, such as references to papers that have used a similar variable or whether the variable is an independent, dependent or control variable in your model. Alternatively, you may add information on when and how the data was extracted from the particular source.