Attempt at a Better MERRA2 Reanalysis

Attempt at a Better MERRA2 Reanalysis | Jun 27, 2025

This is a WIP post that I’m just going to write in this as a markdown file. I’m pretty sure there is no one subscribed to the RSS so it should be fine.

At work I’m helping a graduate student try and predict the air quality impact of wildfires. To do this we need a way of assesing how good our predictions are against some actual source of air quality data. The obvious answer to this is to just use ground sensor data.

While this sensor data does exist and is of a high quality (I think?) its quite sparse and does not cover many fires. To deal with that we wanted to take some sattelite readings of air quality and use that as a target for any machine learning we do.

But! they all kind of suck.

Most sattelite estimations of air quality rely on (fill in later when i understand more about how this works. it has something to do with diffraction of light into the atmosphere)

They don’t really match with sensor readings especially when air quality is bad which makes it pretty bad for our purposes.

Some other people have tried doing a “reanalysis” on the sattelite data, which is basically just training a neural net or some other ML model to take the sattelite data and nudge it closer to sensor readings.

But again! they all kind of suck.

I’m sure lots of work has been put into these and it is probably just a problem that lacks a sufficient amount of data to get something really nice, and probably also the amount of work it would take to get something good isn’t even worth it in the first place.

However there seems to be some low hanging fruit in terms of just throwing more data at the problem. The existing reanalysis attempts have used only a year of data even though like 20 years are available and suffer a bit from overfitting to the training data.

So I’d like to try all of that and maybe use some better practices for avoiding overfitting.

Now here are all the problems with getting a good result that I can think of for now.

Even if you do the best possible job with the reanalysis and avoid autocorrelation and use the maximum amount of data and give the model a bunch of extra info about the reading site, I think you still might bias heavily to locations where humans place air quality sensors. Like maybe you could do some clustering and find sensors that are unusual and see if the model generalizes to them and then maybe check to see how unusual they are in the range of possibly being unusual and see if it’s still good? I bet it would not do well.
I’ll come back later and fill this in above was my main concer.

← Minimalist Lifting Program https://lorn.us