Figure 1: The Big Data EcosystemToday, we are more connected than ever. We live in an ‘always-on’ world whose digital economy has made data a new form of resource that fundamentally changes our lives. But has this revolution really occurred across R&D domains? At a time when global R&D investment is over $1.5 trillion and is more externalized than ever, leading voices still bemoan a lack of open access to decision-making data and an innovation deficit syndrome. Getting globally connected is a problem telecommunications and search engines already have addressed. The solution that addresses the real issue is to enable the R&D community to connect and to collaborate through data; to gain insight to challenging problems through improvements to underlying data quality, provenance and availability: in short — “Big Collaboration for Big Science using Big Data.”

The global importance of R&D to the economy
In 2012, global R&D spending grew by $50 billion to $1.5 trillion. On the surface, this is a rosy outlook, but it masks a much darker picture. At the same time as governments and R&D executives are talking up the ‘Innovation Economy’ and the next 10 million U.S. jobs, entrepreneurs such as Peter Thiel, founder of PayPal, claim that “innovation in America is somewhere between dire straits and dead.” The life sciences sector offers a cautionary tale. Despite hundreds of billions of dollars spent on R&D, as of 2011, U.S. life expectancy rested at just 78.7 years, only four years more than it did in 1980. Is this enough extra ‘life’ for all the money? The spectacular advances in molecular biology, and the high-tech horizon of precision medicine have come nowhere close to matching the effects of basic innovation in providing improved sanitation.

Read More: BIG DATA INSIGHTS: How to Accelerate Discovery in Medicine, Research, Government, Business & More

Pierre Azoulay of MIT and Benjamin Jones of Northwestern University indicate that one factor in the progressive reduction in productivity of R&D may be the “burden of knowledge.” As new ideas and information accumulate, it takes ever longer for thinkers, such as scientists and engineers, to catch up with the frontier of scientific or technical speciality. David Shaywitz, a commentator on medical innovation also has called out poor decision-making as a major contributor to the stifling of new drug development. Given the readiness of technologies in communication and data production, this self-awareness is embarrassing and must be overcome by action.

R&D generates data assets
R&D is tasked with generating and using high-value data assets, and this has traditionally been thought of as a linear process, from idea to product. In fact, data, information and knowledge are created through complex iterative processes that span research, design, development, patent filing, manufacture and post-market. It always has been a collaborative data ecosystem and, over recent years, it has become increasingly globalized, multiparty and multidisciplinary. This is true in many R&D sectors, such as pharma, food and beverage and fast-moving consumer goods (FMCG), whose similarities are well-described by Peter Boogaard.

A thriving knowledge ecosystem functions on high-quality data Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of users of the system when they are connected with compatible, communicating devices. ‘Zuckerberg’s Law’ claims that every year, for the foreseeable future, the amount of information we share will double. This combination has profound meaning for how we use and further exploit the new global ‘grid’ community to increase R&D productivity.
To understand why Metcalfe’s and Zuckerberg’s Laws apply to R&D, we must understand the data lifecycle and its usage from initial capture to impact on consumers. This has not changed in 20 years, but is now much more apparent with the explosion in the amount and variety of data drowning most R&D organizations. For example, the concept of a health sciences data ecosystem coupled with a way of generating data and ensuring its use is becoming more and more distributed and collaborative in the pursuit of better disease treatment.

Capture with context
As with any language, effective communication relies upon context: the metadata that enables you to understand whether one is comparing apples with apples or apples with oranges is critically important. Across all sectors, within every R&D process, data is generated in ways that has vital contextual information for the data consumer. This context could be as simple as temperature variation or as complex as a genome, but capturing instrument, sample preparation methods, analysis parameters or observational data with high context is essential to R&D. Even a ‘simple’ measurement is not meaningful until the experimental and analysis conditions are specified and the subjective observations and conclusions are associated with the data.

Establish data provenance
Contextualized data needs to be stored along with an ontology and its provenance: this is what enables it to be compared and used effectively; weighted against other competing data properly and quality controlled. Data provenance arises from being able to establish who generated the data, how they did it and a full audit trail of any modification. The more complex the data, the more necessary the ‘key data attributes’ are and the more value it can derive. Unfortunately, often this vital context and provenance is lost, ignored or forgotten, dramatically reducing the data’s ability to be compared or used. The trust in the data is reduced or lost entirely and, given the value of today’s data and IP assets, this is wholly unacceptable.

Conclusions and observations: High value information
Importantly, we must capture conclusions, along with the intelligence of the community as they interpret and challenge shared data. We need to harness effective social tools to enable this interaction and capture the observations and subjective information to further add to the data asset value. Adapting what are now social norms, such as tagging, commenting and easy ‘sharing’ into the R&D data landscape is becoming more prevalent. However, it requires the confidence, trust and security within a peer group.

Domain applications enable domain curation
This high-value information, including the quality data and complete contextual information, then needs to be curated, i.e. aggregated, either with local business rules or domain, informed judgement. This was put well by Dr. James McGurk when he said at IDBS’ 2012 Translational Medicine Symposium in London: “The more difficult it is for others to understand your data, the more likely it will be used badly.” If generating this critical context adds significant burden to the data generator, it will not get done routinely. It needs to be made easy to do. Applying process — without unnecessary constraints — is key, and marrying it to ‘the ergonomics’ of how people work will be the norm, not the exception. The tablet revolution, for example, has driven a change in how we think about capturing and consuming information.
Every domain within R&D has processes. W. Edwards Deming, father of the ‘Plan-Do-Check Act’ said “If you can’t describe what you are doing as a process, you don’t know what you are doing.” Here is where domain-aware systems add value over simple document stores or office tools.
These processes can be encapsulated within domain applications or modules that provide consistency of data capture, analysis and ontology control, as well as automatically capturing context. They also can ensure that granular security is applied and an audit trail is captured: these also add to the data’s provenance and trust. These modules can be used across regulated and non-regulated environments, disciplines and sectors.

Cross-domain data: an information commons
Only if there are high context, connected stores can data be effectively aggregated and assimilated to generate a high-quality information landscape. This is the foundation of the big data ecosystem (Figure 1) that can be made available to relevant decision makers, enterprise analytics and the chosen community at large.
Recognized Big Data systems, such as Hadoop, do have leading roles to play in the Big Collaboration story: grid-based analytics is now well established as ‘heavy-lifting gear’ in financial to genomic analysis. But, it is inappropriate to think of an R&D world saved by big data storage and algorithms alone. The problem is bigger than that. Competitor advantage will come from those who first find, then fill the data gaps.

John Reynders, Head of R&D Information at AstraZeneca, reminded the JP Morgan Big Data audience in January 2013, “The future is federated.” This reflects the need to deliver connected R&D against a background of distributed data. Most sane CIOs recognize that creating ‘Death Star’ mega warehouses is both impractical and, when it involves highly regulated data such as patient records, potentially unethical. This view validates the need to make high-value data assets interoperable, comparable and high-quality. Lon Cardon, SVP at GSK, said at the same meeting, “we just need the right approach to noisy datasets.” However, there is great opportunity to learn from other sectors, such as the telecom industry, where effort and investment is focused on reducing the noise and the gaps, not simply accepting them and filtering them out.
If we enable such a foundation of underlying data sets, Metcalfe’s law will then apply. We will have created a high-value data-driven network of well-connected collaborators bringing together an unparalleled community of minds to make decisions using the highest quality, highest availability information.

No more excuses
In today’s cloud-enabled world of extendable bandwidth, which — according to Gilder’s Law — always outstrips computing power, the old limitations of scale no longer apply. Every day, adherents of Zuckerberg’s Law add at least 850 million photos and 8 million videos to the Web alone. Additionally, Gartner recently reported on the availability of global R&D knowledge management systems that can support multidiscipline experimental collaboration. Where such enterprise class R&D data management systems (such as IDBS’ E-WorkBook) exist, there is no reason why high-quality data capture, contextualization, ontology and security should remain long-term strategies. They need to be put into place to enable Big Collaboration, not in a response to it.

Competitive race to the top for quality data
As a cautionary point — getting to the vision of Big Science and Big Collaboration must not be a race to the lowest common denominator of data quality. It must be a race to the highest-context, highest-value distributed datasets to make data re-usable. Many believe that the approaches to enterprise analytics will be ubiquitous, meaning that competitive advantage will derive from two things; the quality of data the analytics has to work on and how good you are at filling in the data gaps.

The foundation principals of context, provenance, curation and connectivity have enabled telecom communities to create networks with massive value. R&D communities should learn from this success and build the quality systems, datasets and collaboration tools that will enable their Big Data to deliver Big
Collaboration for Big Science.

1. 2013 Global R&D Funding Forecast (Battelle, R&D Magazine, December 2012)
4 . Benjamin Jones; National Bureau of Economic Research, May 2005, Working Paper No. 11360
6. Scott Weiss, Drug Discovery & Development, May 2011, p14-15
7. Peter Boogaard, Scientific Computing, October 2012, p.14-16
9. Michael Shanler, Gartner Inc. Manufacturers Must Consider Scientific Domain Expertise During ELN Selection, January 2013

Yike Guo is a professor in the department of computing and also serves as Technical Director of the Parallel Computing Center and Head of the Data Mining Group at Imperial College. He founded InforSense in November 1999 and became CTO in 2009. Guo currently serves as Chief Innovation Officer at IDBS. He may be reached at