A key problem for researchers interested in deriving insights from data is to bring raw data into a format that can be used by statistics software such as Python or R. Bringing raw data into a usable format often creates headaches. Raw data comes in so many different shapes and formats that we almost always have to get our hands on.
In this blog post, we will be dealing with a format called JSON (Java-script object notation). We will first analyze the structure of the JSON format. Second, we will fillet JSON data step-by-step to make it usable. For all of this we will use Python.
What is JSON?
One popular format for raw data is JSON . For example, data retrieved from APIs or exported from NoSQL databases (e.g., MongoDB) is typically JSON.
What’s special about JSON?
On first sight, JSON data looks like spaghetti. Here’s an example taken from data on software repositories. The data contains the name of a software repository (repo_name), along with a list of programming languages used in the repository, along with the precise number of bytes.
How to extract data from JSON?
Let’s get the data from this JSON piece into Python. Here’s my proven three-step procedure.
To get a handle on the spaghetti-structure of JSON, use some tools to visualize its structure. Once you understand the structure, you can more easily decide on what you need and how to extract it. To make it pretty, there are two approaches:
- JSON-Formatter: a little webpage where you can paste your JSON-spaghetti and retrieve a visualized tree structure of the data. Works particularly good for longer JSON sets.
- Print it on the console: print(json.dumps(parsed_json, indent = 4,sort_keys=False)): Command can be put in Python, then JSON is printed on the console.
See, our above code has become more readable:
Below is the code I use to read the JSON file and access the data.
Let’s continue with the above example. Say, we want to extract the programming languages used in the project. Then we would first read in the entire JSON file. Second, we would iterate over all language elements in the JSON file and write them into a variable
json_dta = json.loads(l) #this loads the JSON file into Python if "language" in json_dta: #check if element exists for lang_element in json_dta["language"]: lang=lang_element["name"] #write element in variable #do something with the data