Conference Review (past record): KSME (Korean Society of Mechanical Engineers) — Fluid Mechanics Competition Gold Award

2024.10.09 ·#conference #KSME #neural-network #ANN #PM2.5 #air-quality #Window-Learning #AIC #VIF #environmental-engineering #thesis #review

Back then I'd left a review on my Naver blog, but to keep a record of the past, I'm refining it and moving it over to a Tistory blog.

As an undergraduate, on November 15, 2019, I submitted a paper to the Fluid Mechanics Competition hosted by the Korean Society of Mechanical Engineers (KSME) and won the Gold Award.

* This research was also the topic of my 4th-year graduation thesis (undergraduate research).

Figure 1. Photo from the KSME conference held in Jeju (name mosaicked)

Research topic: Predicting fine-dust concentration via a neural network

After collecting weather and air-quality data as variables that can explain fine dust (PM2.5), I process and optimize that data. Based on the processed data, I build a prediction model for PM2.5 concentration via supervised learning, aiming to predict the concentration 6 hours and 12 hours ahead.

❓ A system that predicts fine-dust concentration already exists — is there really a need to do this?

You might have such a question. In conclusion, a system that predicts concentration already exists and matches reasonably well, above a certain confidence level. But the existing one produces its results by repeating computations tens of thousands of times — requiring a lot of time (computations often taking more than a month) and considerable resources (money, computers, etc.).

However, training the neural network in this research was done with minimal computing resources, processing and training on about 100,000 data points. In other words, it has the distinct advantage that the amount of computation can be greatly reduced depending on the situation; and the expected benefit was that things not actually accounted for are also baked into the real measured data, so it could, in some cases, carry more explanatory power.

Figure 2. PM2.5 concentration prediction 6 hours ahead

To present one result: the actual concentration is shown in red, and the concentration predicted by the AI (neural network) is in green. The 2019 version of me judged that the trend was being matched reasonably well. However, later — in 2021, during my second research topic — it was revealed that a time-lag phenomenon existed in that result. I have the experience of having been able to mitigate it to some degree by attempting to correct this phenomenon through reinforcement learning.

At the time I thought I was focusing on quickly predicting the trend with at least a certain level of accuracy rather than predicting a 99%-accurate value — but because of the time-lag phenomenon, I learned through the 2021 research that the prediction was actually wrong.

📌 [A 2019 resolution — left word-for-word, since it was my thinking at the time]

Going forward, I'm excited about how effective AI will be in (the physical world) such as multiphase flow and classical fluids, and I keep studying 'statistics.' In particular, since more and more fields are becoming based on Bayesian statistics rather than parametric statistics, I recommend studying Bayesian statistics. (I really did study a lot of statistics afterward.)

It was a good experience to have as a mechanical-engineering and environmental-engineering undergraduate, and after graduating, even through grad-school life, I'll aim to do a lot of research building on this experience, ultimately come up with many ideas, and do my best to develop technology.

*I give applause once more to my 2019 self. I'm now an office worker, but it seems I still haven't let go of that dream.

From here on is what the present me (October 9, 2024) writes as a continuation. I'd like to share the data-processing method from back then.

A. Data Collection

Figure 3. Data collection and sources

Air-quality data such as PM2.5, PM10, NOx, and SOx came from Air Korea, and weather data from the KMA Open MET Data Portal; collecting data from 2008 to Q3 2018, I gathered a total of 32 variables (columns) and 94,225 observations (rows).

B. Data Process

(Step 1) Removal of Deficiency data

Among the collected data, I deleted variables (columns) where more than 50% of the values were empty (Not Available). I judged that they would contaminate the data during later imputation.

(Step 2) Explanatory-variable selection: AIC (Akaike Information Criterion) — Stepwise selection method

Figure 4. Data processing: variable selection

Among the variables (columns) remaining after Step 1, I use the AIC model to remove variables with low explanatory power for the target variable, PM2.5. As a result, the two variables Middle and low cloud and Sunshine hour were removed, leaving 21 variables. However, I kept in mind that independence between variables was still not guaranteed.

(Step 3) Independent-variable selection (Check of Independency): VIF (Variance Inflation Factor)

Figure 5. Data processing: independent-variable selection via VIF

Since guaranteeing independence among explanatory variables is a basic assumption of the model, I ensure independence among the explanatory variables through the Variance Inflation Factor and mechanistic theory.

The VIF value only provides the statistical result that correlation between variables is high; the internal process of removing variables must be done through engineering knowledge. For example, in the case of relative humidity and vapor pressure, looking at the right side of Figure 3 you can confirm that relative humidity is defined through vapor pressure. In such a case, I judge it correct to choose vapor pressure, the more fundamental variable. Based on such judgments, I went on selecting the independent variables.

(Step 4) Final selected variables (Selected Variables)

List of 13 variables

PM10, Visibility, SO2, CO, NO2, Vapor Pressure, O3, Sea level pressure, Temperature, Solar radiation, Total cloudy, Wind direction, Wind speed

Going through the above three processes, 13 explanatory variables were selected to explain PM2.5 (fine dust).

(Step 5) Imputation

Figure 6. Imputation result

In Step 1, I only removed variables with more than 50% missing values; NA values still exist in the remaining variables. Therefore, for the final 13 selected variables, I filled in the NA values using the Predictive Mean Matching (PMM) method. The right-side figures (a) and (b) show the trend of the filled-in values. Red dots are imputed values, and green dots are previously measured values.

For reference, I tried imputation using a Deterministic Linear Model and a Stochastic Linear Model, but they produced non-physical values and couldn't be used.

Going through the five processes above, I finished the data processing and completed the preparation for model training.

C. Model Training

Figure 7. Training model (NN)

Using the processed data above, I train the prediction model. I trained a basic ANN model with the Window Learning technique.

Figure 8. Batch Configuration

Figure 8 shows how the batches were configured to use the previous 72 hours of data to predict 6, 12, and 24 hours ahead. To predict 1 hour at the 24-hour mark, this is the process of reshaping 14 variables — including PM2.5 — over 72 hours and converting them into input data. Here the batches are set to 100 each.

D. Training & Test Result

Figure 9. Training result and test result (scatter plot)

At the time, I checked the degree of training via the cost and validation values. I also presented the test results as a scatter plot. I could see that the higher the concentration being predicted, the more the model's performance degraded. And now that I think about it, a scatter plot doesn't even capture the time-lag phenomenon.

Figure 10. Predicted vs. observed time-series graph

In 2019 I didn't know this result was completely wrong. Looking at the overall picture — especially the 6-hour prediction model — it seemed to match well. But now, due to the time-lag phenomenon mentioned earlier, I can see this result is effectively wrong. Whether it was that the process was good or that the presentation was good, I don't know — but with a wrong result I confidently fooled(?) the professors and won. The Gold Award, no less ^^

The story afterward — in grad school, 2021

Later, in grad school, through reinforcement learning — considering not only Euclidean distance but also angle — I corrected the time-lag phenomenon to some degree: I saw an effect. But because I was absolutely short on time (busy with my main research), I ultimately couldn't finish it and graduated.

📦 Migrated from the Tistory blog I used to run. Original: taehyuklee.tistory.com/18