انت الان في قسم هندسة تقنيات الحاسوب

نشر مقالة علمية للتدريسي (م.د مياس محمد مهدي) مقرر قسم الحاسوب بعنوان Big Data: Challenges and Future Research Directions تاريخ الخبر: 31/05/2023 | المشاهدات: 201

مشاركة الخبر :

The big data movement is creating opportunities for the chemical process industries to improve their operations. Challenges, however, lie ahead.

The big data movement is gaining momentum, with companies increasingly receptive to engaging in big data projects. Their expectations are that, with massive data and distributed computing, they will be able to answer all of their questions — from questions related to plant operations to those on market demand. With answers in hand, companies hope to pave new and innovative paths toward process improvements and economic growth.

An article in Wired magazine, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” (1), describes a new era in which abundant data and mathematics will replace theory. Massive data is making the hypothesize-model-test approach to science obsolete, the article states. In the past, scientists had to rely on sample
testing and statistical analysis to understand a process. Today, computer scientists have access to the entire population and therefore do not need statistical tools or theoretical models. Why is theory needed if the entire “real thing” is now within reach?

Although big data is at the center of many success stories, unexpected failures can occur when a blind trust is placed in the sheer amount of data available — highlighting the importance of theory and fundamental understanding.

A classic example of such failures is actually quite dated. In 1936, renowned magazine Literary Digest conducted an extensive survey before the presidential election between Franklin D. Roosevelt and Alfred Landon, who was then governor of Kansas. The magazine sent out 10 million postcards — considered a massive amount of data at that time — to gain insight into the voting tendencies of the populace. The Digest collected data from 2.4 million
voters, and after triple-checking and verifiying the data, forecast a Landon victory over Roosevelt by a margin of 57% to 43%. The final result, however, was a landslide victory by Roosevelt of 61% versus Landon’s 37% (the remaining votes were for a third candidate). Based on a much smaller sample of approximately 3,000 interviews, George Gallup correctly predicted a clear victory for Roosevelt.

Literary Digest learned the hard way that, when it comes to data, size is not the only thing that matters. Statistical theory shows that sample size affects sample error, and the error was indeed much lower in the Digest poll. But sample bias must also be considered — and this is especially critical in election polls. (The Digest sample was taken from lists of automobile registrations and telephone directories, creating a strong selection bias toward middle- and upper-class voters.)
Another example that demonstrates the danger of putting excessive confidence in the analysis of big data sets regards the mathematical models for predicting loan defaults developed by Lehman Brothers. Based on a very large database of historical data on past defaults, Lehman Brothers developed, and tested for several years, models for forecasting the probability of companies defaulting on their loans. Yet those models built over such an extensive database were not able to predict the largest bankruptcy in history — Lehman Brothers’ own.

These cases illustrate two common flaws that undermine big data analysis:

the sample, no matter how big, may not accurately reflect the actual target population or process
the population/process evolves in time (i.e., it is nonstationary) and data collected over the years may not accurately reflect the current situation to which analytics are applied.
These two cases and other well-known blunders show that domain knowledge is, of course, needed to handle real
problems even when massive data are available. Industrial big data can benefit from past experiences, but challenges lie ahead.