Technical Report
Executive Summary
This project examines the relationship between air quality and various socio-economic and demographic indicators across U.S. states. These indicators include the percentage of adults aged 25 and older who did not complete high school, the Human Development Index assigned to each state, the number of people experiencing homelessness per 10,000 residents, the percentage of the homeless population that is unsheltered, and the Health, Education, and Income indices for each state.
We chose to measure air quality in terms of PM2.5, which refers to fine particulate matter with a diameter of 2.5 micrometers or less from sources like vehicles, factories, and wood burning. Values were averaged from a random sample of 20 air-quality sensors in each state.
Project Context
Air pollution is a critical public health concern, with PM2.5 being particularly harmful due to its ability to penetrate deep into the lungs and bloodstream. Understanding how and if air quality correlates with socio-economic factors such as education, income, and homelessness can help stakeholders such as public health agencies and policymakers prioritize resources and interventions.
Success criteria for our project include creating a Python package that successfully cleans data from our sources and produces a final dataset that can be analyzed both visually and through descriptive statistics.
Data Sources
- Primary dataset: OpenAQ – Global air quality measurements, focusing on PM2.5 concentrations across U.S. states. Data was accessed via API.
- Supplementary data: Measure of America – Socio-economic indicators at the state level, including Human Development Index (HDI), Health Index, Education Index, Income Index, and homelessness statistics. Provided as Excel files.
Methodology
- Data Acquisition
We collected PM2.5 data from OpenAQ using their API. Obtaining individual sensor data required multiple API requests. First, we queried the API to retrieve a list of U.S. locations. Using these locations, we then sent additional requests to obtain sensor-level measurements, filtering to include only sensors that recorded PM2.5 values. After removing missing (NaN) values, we grouped the data by state and selected a random sample of 20 PM2.5 sensors per state. For each sampled sensor, we calculated the most recent yearly average and used these values to estimate a state-level PM2.5 average.
The Measure of America datasets were freely downloadable in Excel format; however, we were required to provide contact information, describe our intended use of the data, and consent to using the data for non-commercial purposes only.
- Cleaning pipeline
Data cleaning steps included handling missing values, standardizing column names, and converting data types. Measure of America did not provide a single consolidated Excel file; instead, separate files were used to report different indicators (e.g., education, environmental information, and the Human Development Index). To assemble the required variables, we combined multiple Excel files, resolved multi-level headers, and selected only the relevant fields.
Similarly, the OpenAQ data required merging multiple data frames obtained from separate API requests. These were consolidated into a final data frame containing one row per U.S. state and an estimated state-level PM2.5 value, which was then joined with the socio-economic data.
The main tools that we used throughout this cleaning process were the python libraries pandas and requests.
Results & Diagnostics
Using our package, we performed several analyses on our data. The analyses and their results are outlined in the following sections.
Comparing Regions: Northern vs Southern States
We used a t-test to compare Northern and Southern states. The only variable showing a statistically significant regional difference is Health_Index. Other variables (air pollution, education, income, homelessness) do not differ significantly between North and Southern states.
| Variable | t-statistic | p-value |
|---|---|---|
| Avg_PM25 | -0.6254 | 0.5377 |
| Health_Index | 4.6735 | 0.0001 |
| Education_Index | 1.6863 | 0.1055 |
| Income_Index | 0.8212 | 0.4191 |
| Homeless_Ratio | -0.0878 | 0.9310 |
Comparing Regions: East vs West States
Looking at Eastern vs Western states, there is a significant difference in both education and health indices, but none of the other socioeconomic measures.
| Variable | t-statistic | p-value |
|---|---|---|
| Avg_PM25 | 0.0927 | 0.9268 |
| Health_Index | -2.3041 | 0.0257 |
| Education_Index | 2.2673 | 0.0279 |
| Income_Index | 1.4959 | 0.1413 |
| Homeless_Ratio | -0.7737 | 0.4428 |
Best OLS Model for Average PM 2.5
Predictors: Not Graduated, HDI, Homeless Ratio
The best Ordinary Least Squares (OLS) model for predicting average PM2.5 levels looked at three state-level factors: the percentage of people age 25 and over who did not graduate from high school, the Human Development Index (HDI), and the homelessness ratio. The model explains only a small part of the differences in PM2.5 between states. Of the three factors, not graduating and homelessness were significant predictors: states with more people who did not graduate tended to have higher PM2.5, while states with higher homelessness had slightly lower PM2.5. HDI showed a small positive effect, but it was not strong enough to be considered statistically significant. The model also suggests that the predictors themselves might be closely related (multicollinearity), which can make the specific size of the effects a little less certain. Some tests suggest that the data is a bit uneven (the errors are not normally distributed), so the results should be interpreted with caution.
Model Fit Statistics
| Metric | Value |
|---|---|
| Number of observations | 51 |
| R-squared | 0.135 |
| Adjusted R-squared | 0.080 |
| F-statistic | 2.446 |
| Prob (F-statistic) | 0.0755 |
| Log-Likelihood | -138.30 |
| AIC | 284.6 |
| BIC | 292.3 |
| Durbin–Watson | 1.922 |
Regression Coefficients
| Variable | Coefficient | Std. Error | t-statistic | p-value | 95% CI |
|---|---|---|---|---|---|
| Intercept | -4.7427 | 6.893 | -0.688 | 0.495 | [-18.609, 9.124] |
| Not_graduated | 0.4467 | 0.216 | 2.065 | 0.044 | [0.012, 0.882] |
| HDI | 2.0516 | 1.115 | 1.840 | 0.072 | [-0.191, 4.295] |
| Homeless_Ratio | -0.0818 | 0.039 | -2.086 | 0.042 | [-0.161, -0.003] |
Model Diagnostics
| Test | Statistic | p-value |
|---|---|---|
| Omnibus | 37.476 | 0.000 |
| Jarque–Bera | 108.202 | 3.19e-24 |
| Skewness | 2.040 | — |
| Kurtosis | 8.855 | — |
| Condition Number | 337 | — |
Discussion & Next Steps
Our analysis aimed to identify key state-level factors associated with average PM2.5 levels across the U.S. using an Ordinary Least Squares (OLS) regression model. The overall model only explained a small portion (13.5%) of the differences in PM2.5 between states, indicating that most of the variation is driven by other factors not included in this analysis, such as local industry, geography, or specific air quality regulations.
The choropleth map on our Streamlit app shows the geographical spread of PM2.5 values, highlighting state-to-state variation. Notably, the map shows that Virginia has a high PM2.5 value, which appears to be an outlier that could be influencing the regression results.
The correlation matrix reveals some interesting relationships. Unfortunately, the average PM2.5 value for each state only has a weak positive correlation with the percentage of adults who did not graduate high school, and an almost nonexistent correlation with the other variables, including the Health, Education, and Income indices.
As expected, the Human Development Index (HDI) is highly correlated with its constituent parts—Health, Education, and Income indices. There are, however, to suprising observations. First, there is a moderate positive correlation (+0.53) between the Homeless Ratio and the HDI, which seems counter-intuitive and may be worth further investigation. The correlation matrix also shows a weak linear relationship between the percentage of electricity generated by coal and natural gas and the average PM2.5 levels, which we expected would have a relationship.
Limitations
Multicollinearity: The OLS model and the correlation matrix indicated that there was multicollinearity—several predictor variables are highly correlated with each other. This makes it difficult to isolate the unique effect of any single predictor.
Outlier Influence: The presence of a likely outlier (one of Virginia’s sensors) may be distorting the fit of the model and the coefficient values.
Future Experiments and Open Questions
To build a more robust predictive model, future analysis should consider the following steps:
- Address Outliers: Run the model again after removing the outlier sensor in Virginia to see if anything changes.
- Explore Different Predictors: Given the model’s low R², additional variables should be investigated, such as state-level air quality regulations, population density, or local industry composition (e.g., manufacturing vs. service).