Need for Speed: Ramping up the Velocity of Big Data
Big Data tools such as Grok and IBM Watson are enabling large organizations to behave more like agile startups
Of the transformative technology developments that have ushered in the current frenzy of activity along the information superhighway, the 1994 invention of the “Wiki” by Ward Cunningham is among the most disruptive. Evolving over the past 20 years, the idea that average users can contribute information to a Web site and have that information be displayed immediately and shared with other users has served as the basis for our social networks and has spilled over to the Internet of Things. Cunningham chose WikiWikiWeb for the name of his application as a reference to the Wiki Wiki Shuttle, the fare-free shuttle bus system at the Honolulu International Airport. “Wiki” is Hawaiian for “quick,” with its recursive use, “Wiki-Wiki,” meaning “very quick.”
This design concept of “quick updates” transformed the Internet from a metaphor of electronic document library to that of electronic, continuously-growing, collaborative, rapid-response, research library. Along the way, search and information retrieval tools have become increasingly valuable to a global data repository that is growing exponentially. Recent statistics from YouTube report that users are uploading 100 hours of video each minute, generating over 400 years of video that is automatically analyzed for copyright ownership each day. A 2012 report sponsored by EMC, owner of VMware, RSA Security and Iomega, predicts that by the year 2020 the collective global size of stored data will exceed 40 trillion gigabytes, over 5 gigabytes per each human.
In the 1950s, when information was stored on paper pages, cards and tape, the technical challenge was to find ways for growing the size of our data collection. Given our increasing ability to store more information into smaller spaces, the value of collected data is no longer based solely on the size of the collection, but additionally on how fast we can retrieve useful information from it. This need for speed is reflected in the emergence of Agile Software Development methods around the turn of the century. Ward Cunningham, along with colleagues Kent Beck, Martin Fowler and others, signed onto the 2001 Manifesto for Agile Software Development — a system of software development ideals that prized collaboration, interaction and rapid response to changing needs and environments over traditional predict and plan methods — that spent a great amount of time developing rules and regulations to guide the creation of working software at a date years in the future.
Much like the adoption of agile methods that respond rapidly to changing requirements and emerging technologies, big data — and its associated analysis methods of the knowledge discovery and data mining (KDD) process — is shifting to the use of automated tools that analyze real-time data as it is collected. Over the past decade, Jeff Hawkins, the founder of Palm Computing and Handspring, has been exploring his framework of Hierarchical Temporal Memory, an algorithm that is based on the neuroscience of how human intelligence functions within our brains. This research has resulted in the creation of an open source project named the “Numenta Platform for Intelligent Computing (NuPIC)” and an initial commercial service called “Grok.”
Leveraging complex neuroscience and algorithm design, Grok autonomously develops analysis models of streaming data on the fly. The software constructs hundreds of candidate models for each data stream, from which it selects the most appropriate. Grok can perform this function for thousands of simultaneous data streams in minutes, about the same time it takes a human scientist to open the tools they would use to begin the analysis of a single stream. Grok continues to optimize its models as the data drifts and changes due to fluctuations within processes and their environments. The software detects anomalous behavior in the data streams and alerts human operators as to where they should focus their attention and problem-solving skills to prevent system failures.
During the same decade, the International Business Machines Corporation (IBM) launched an internal “Grand Challenge” project, known as IBM Watson. Following the success of IBM Deep Blue in the area of chess and IBM Blue Gene and its utility in mapping the human genome, IBM Watson was set to the task of answering questions posed using natural language. IBM Watson’s ability to provide answers to written questions was showcased in the IBM Jeopardy! Challenge in which IBM Watson competed and won against two Jeopardy! Grand Champions, Ken Jennings and Brad Rutter. In February 2011, IBM Watson parsed the text of each Jeopardy! clue and rapidly examined 200 million pages of information residing in four terabytes of storage to formulate candidate answers and their associated confidence levels in two to three seconds. Similar to Grok’s ability to construct and consider candidate models of data streams, IBM Watson quickly formulated candidate answers and examined its supply of reference information to gather supporting information for each possible answer. While access to four terabytes of information gathered from encyclopedias, newswires, literary works and the full text of Wikipedia is impressive, IBM Watson’s ability to analyze this information and provide meaningful answers in under three seconds is astonishing.
On January 9, 2014, IBM announced the creation of the IBM Watson Group, a business unit that follows in the footsteps of storied IBM business units formed around the inventions of the mainframe and personal computers. Initial commercial adopters of IBM Watson include businesses in the area of healthcare, medical research and finance. In addition to learning from documents gathered from the Internet, individual companies can upload their collective organizational knowledge base into IBM Watson and ask it to provide answers based on their unique expertise. While maintaining a long view of general trends and future projections remains important, big data tools such as Grok and IBM Watson are enabling large organizations to behave more like agile startups when it comes to quick decision making by supplying quick answers to individual processes, patients and customers.
• IDC IVIEW: The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf
• Manifesto for Agile Software Development http://agilemanifesto.org
• NuPIC, numenta.org
• Grok, www.groksolutions.com
• IBM Watson www-03.ibm.com/innovation/us/watson
William Weaver is an associate professor in the Department of Integrated Science, Business, and Technology at La Salle University. He may be contacted at editor@ScientificComputing.com.