Despite some doubts I expressed yesterday, the “Big Data” webcast did indeed happen today (although it appeared that were no media in the audience except for a representative of Science, and presumably he was there because the webcast was staged at the AAAS auditorium.)
That’s the end of my complaining, because I found it to be a pretty good and informative webcast (no word on when/if it will be available for replay), and here are a few takeaways about the consensus that is forming, at least at the federal level:
• An overview: In virtually every field of science and engineering, a large number of data sets are being generated. The key challenges now are 1) being thoughtful about accumulating and storing the data; 2) moving from data to knowledge; and 3) moving from knowledge to action.
• Data is being accumulated in structured, semistructured and unstructured forms.
• The data present enormous challenges to sharing and/or moving (many terabytes of genome data are being moved to cloud storage with the help of Amazon).
• The data present enormous challenges to glean knowledge. Where are the software tools and algorithms? Are they scalable? What data is best analyzed by machine learning? Can data be converted for better human interpretation (e.g., advanced visualization techniques)?
• Data present statistical challenges to separate noise from knowledge.
• Yes, NSA and the intelligence community probably are way ahead on this.
• There is already a major problem with having people trained in advanced data management, interpretation and statistical analysis, and that means both next generation and current generation of professionals must be trained. (Ironically, it was noted that things come full circle because there are suddenly huge real-time data sets coming out of high-level and free education projects like Khan Academy and Stanford University’s experiments that will revolutionize our understanding of learning and accelerate education on worldwide scale.)
Many of the speakers used the platform to announce major new initiatives and awards. Here are some of the ones coming out of DOE, NSF, DOD and DOD/DARPA:
SDAV Institute takes aim at improving the nation’s ability to extract knowledge and insights from large and complex collections of digital data. Led by the Energy Department’s Lawrence Berkeley National Laboratory, the SDAV Institute will bring together the expertise of six national laboratories and seven universities to develop new tools to help scientists manage and visualize data on the Department’s supercomputers, which will further streamline the processes that lead to discoveries made by scientists using the Department’s research facilities.
“Big Data” is a new joint solicitation to advance the core scientific and technological means of managing, analyzing, visualizing, and extracting useful information from large and diverse data sets. This will accelerate scientific discovery and lead to new fields of inquiry that would otherwise not be possible. NIH is particularly interested in imaging, molecular, cellular, electrophysiological, chemical, behavioral, epidemiological, clinical, and other data sets related to health and disease.
NSF is also:
- Encouraging research universities to develop interdisciplinary graduate programs to prepare the next generation of data scientists and engineers;
- Funding a $10 million project based at the University of California, Berkeley, that will integrate three powerful approaches for turning data into information machine learning, cloud computing, and crowd sourcing;
- Providing the first round of grants to support “EarthCube” — a system that will allow geoscientists to access, analyze and share information about our planet;
- Issuing a $2 million award for a research training group to support training for undergraduates to use graphical and visualization techniques for complex data;
- Providing $1.4 million in support for a focused research group of statisticians and biologists to tell us about protein structures and biological pathways;
- Convening researchers across disciplines to determine how Big Data can transform teaching and learning.
Department of Defense — Data to Decisions Program to Provide $250 Million Annually Including $60 million for New Research Projects, to
- Harness and utilize massive data in new ways and bring together sensing, perception and decision support to make truly autonomous systems that can maneuver and make decisions on their own.
- Attain a 100-fold increase in the ability of analysts to extract information from texts in any language;
- Attain a similar increase in the number of objects, activities and events that an analyst can observe: and
- DOD will announce a series of open prize competitions over the next several months to accelerate innovation in Big Data that meets these and other requirements.
- Investing $25 million annually to develop computational techniques and software tools for analyzing large volumes of data, including semistructured (e.g., tabular, relational, categorical, metadata) and unstructured (e.g., text documents, message traffic);
- Developing scalable algorithms for processing imperfect data in distributed data stores;
- Creating effective human-computer interaction tools for facilitating rapidly customizable visual reasoning for diverse missions; and
- Support open source software toolkits to enable flexible software development for users to process large volumes of data in timelines commensurate with mission workflows of targeted defense applications.
There’s a lot of overlap in the projects being announced, so one would expect a lot of interagency sharing. Also, it strikes me as being underfunded, but one would expect that there are several deep-pocket private sector partners who willing to put up matching funds or in-kind resources (such as the Amazon example, above).
The White House has also prepared a “Big Data Factsheet” (pdf) outlining all of the current and new work at the federal level that is underway.