KSME Conference Review — Neural-Net PM2.5 Prediction (Gold Award, KSME Fluid Mechanics Competition, 2019)

📅 A retrospective on research I did for my undergraduate thesis in November 2019. Won the Gold Award at the KSME (Korean Society of Mechanical Engineers) Fluid Mechanics Competition. Re-organized in 2024, with an honest note about a pitfall I only spotted two years later in grad school.

Korean Society of Mechanical Engineers conference in Jeju, 2019 Figure 1. KSME 2019 academic conference in Jeju — Fluid Mechanics Competition gold-award ceremony.

What I did

Trained a neural network (ANN) to predict PM2.5 concentrations, using weather and air-quality data, 6 hours and 12 hours ahead.

PM2.5 prediction result, 6 hours ahead Figure 2. PM2.5 prediction, 6 hours ahead (conference slide).

Why I thought it mattered

The conventional approach to PM2.5 forecasting is numerical methods (CFD + atmospheric chemistry models), which require: - >1 month of compute + massive resources. - Accurate, but expensive.

The neural-network approach trades some accuracy for: - ~100k samples is enough to train. - Inference is near-instant after training. - Compute-vs-accuracy ratio is dramatically better.

Goal: a model that trades some accuracy for cheap, fast short-term prediction.

Data pipeline

A. Data collection

Data collection sources Figure 3. Data collection and sources.

B. Data preprocessing — 3-step variable selection

Step 1. Drop sparse variables

Removed variables with ≥50% missing values (to keep training data clean).

Step 2. Stepwise selection by AIC

AIC (Akaike Information Criterion) balances model fit against the number of variables. Lower is better.

Using stepwise selection: - Dropped Middle and low cloud and Sunshine hour. - Left with 21 variables.

Data preprocessing — AIC variable selection Figure 4. Data preprocessing — stepwise AIC variable selection.

Step 3. Independence check with VIF

VIF (Variance Inflation Factor) measures multicollinearity. A high VIF means strong linear relationships with the other predictors — keep only one of the colinear pair.

Here I combined statistical results + domain knowledge: - e.g., relative humidity vs. vapor pressure both had high VIF → kept vapor pressure since it's the more fundamental physical quantity (relative humidity is derived from vapor pressure and temperature).

Data preprocessing — VIF independence check Figure 5. Data preprocessing — VIF for independent-variable selection.

Step 4. Final 13 variables

pm10 · Visibility · SO2 · CO · NO2 · Vapor Pressure · O3 · Sea level pressure · Temperature · Solar radiation · Total cloudy · Wind direction · Wind speed

C. Imputation

Used Predictive Mean Matching (PMM). Linear-model-based imputation can produce non-physical values (e.g., negative concentrations), so I rejected it.

Imputation results Figure 6. PMM imputation results.

Model — ANN + Window Learning

NN model architecture Figure 7. Training model — neural-network architecture.

A basic ANN with Window Learning on top — the standard pattern for time-series forecasting:

Take 72 hours of prior data as one input window → predict PM2.5 at 6 / 12 / 24 hours ahead.

Batch configuration Figure 8. Batch configuration — 14 variables restructured, batch size 100.

Results

Train/test scatter Figure 9. Train / test scatter plots.

Predicted vs. observed time series Figure 10. Predicted vs. observed PM2.5 time series.

At the time of the 2019 presentation, the results "looked fine." Scatter R²/MAE were reasonable, and the time-series plot tracked the overall trend.

Two years later — the pitfall I missed

While revisiting the same model in 2021 (grad school), I noticed something:

Time lag — the model wasn't really predicting the future; it was just copying the nearest past value. The "6-hour-ahead" prediction matched the value from 6 hours before almost exactly.

Scatter plots hide this entirely. Points cluster well, R² looks great — but if you unfold the time series and align by timestamp, you see the prediction is just shifted back by N hours.

A partial fix (RL with distance + angle)

In grad school I tried again with reinforcement learning: - Added an angle term (direction of change in time) to the loss, on top of the standard Euclidean distance. - The lag was reduced somewhat. - Didn't finish it cleanly — main research took priority.

Retrospective — then and now

"I won gold by (unknowingly) fooling the judges."

The award was real but I hadn't seen the underlying flaw. Still, what stays: - Building the entire pipeline (data collection → variable selection → imputation → model → eval) by hand was the real learning. - The statistics and neural-network fundamentals from this work are still the base I work from. - Wrong results are also learning — "don't trust scatter plots alone" was the lesson, and it has saved me twice since.

What I wrote in 2019

"I'm excited about how AI will combine with classical fluid mechanics — multiphase flows, traditional fluids. Still studying statistics. Bayesian statistics in particular — highly recommended."

And in 2024

Working in industry now, but the picture I drew back then still holds: environmental time series like PM2.5, multiphase / CFD + ML, Bayesian methods — all still interesting territories.

References


📦 Migrated from my own Korean blog (my own writing). Original: taehyuklee.tistory.com/18

Share𝕏f

Comments