Parsing for Pollution


This post is just really a graph of air pollution that we see around the world measured according to the AQI index.

The site is structured very cleanly and allows us to parse the data with ease. I am using Python 3.6
in conjunction with Requests and BeautifulSoup to get AQI index numbers for the top 300 cities of the world by population.

An interesting point to note is that several cities still do not publish pollution indices. It is mind-boggling that governments around the world are taking air pollution so lightly.

Porting this data into Tableau, we get these rather startling graphs

Note how India is higher in this index than China. I would have thought that with all the manufacturing being in China, the pollution levels will be higher there.

Delhi is really off the charts! This level of air pollution means a lot of lung diseases are developing in this very populous city.

The script I am using to get this data is here:

---------------------------------------------START SCRIPT--------------------------------------------------

This script does the following:
1. Reads from a while all the URL's associated with the 300 most populaous cities around the world
2. Parses the contents of the URL using Requests and BeautifulSoup
    2a. Make sure to strip the end of line character before parsing the URL passed in as a line
    2b. Use try-except sequence because some of the urls will not exist
3. Use re to get the city value and get ready to publish it with the score
delhi 599
beijing 78
newyork 28

This is a list of cities with AQI scores. It can be used to analyse that data in BI tools.

Further enhancements
I. We could use pandas to output.
II. We could output to a specified text file.
III. I will use Django at some point and publish live score online.

from bs4 import BeautifulSoup
import requests
import re
#debugging on or off

#open the input files to read from
filename = 'cities.txt'
file = open(filename, "r")
for line in file:
    line = line.rstrip()
    url = requests.get(line)
        soup = BeautifulSoup(url.content, 'html.parser')

        aqi_id = soup.find(id="citydivouter")
        aqi_score = aqi_id.find(id="aqiwgtvalue")

        x = re.sub(".*city/","",line)
        y = re.sub(".*city/","",line)
        print(y +" NA")
--------------------------------------------END SCRIPT---------------------------------------------------

The text file contains URL's like this



Popular posts from this blog

A note on Demographic information in Google Analytics

UserID tracking in GTM