For Storing Web 3.0, HBase has the Edge
A comparison of cutting-edge cloud and relational database technologies
A storage system modeled after Google’s BigTable has the edge in data management for next generation Internet and cloud computing users, claim researchers at the University of Texas – Pan American (UTPA) in Edinburg. In tests designed to find the best storage technologies for Web 3.0 — also known as the Semantic Web — Apache’s Hadoop database, HBase, out-performed MySQL Cluster, the UTPA team discovered in a classic confrontation between relational and non-relational databases.
With their own algorithms to adapt the two database systems, the team found that HBase works faster with larger datasets, a major issue since the Semantic Web comprises vast amounts of tags and descriptions known as metadata.
Unlike the current Web 2.0, “the Semantic Web is interconnected metadata that facilitates better information search, discovery and integration,” says study co-author and UTPA assistant computer science professor Artem Chebotko, Ph.D. “Our work, to our best knowledge, is the first to empirically compare HBase and MySQL Cluster for metadata management.”
The results make sense to Retrevo chief scientist Aditya Vailaya, Ph.D., whose Sunnyvale, CA-based firm uses Semantic Web technology to compare thousands of retail electronic devices on price, performance and myriad other metadata factors. Query performance is key with the Semantic Web’s large data stores, Vailaya says. But, most databases aren’t equipped with query-improving algorithms, and HBase lends itself to writing them.
“HBase is easier to use than MySQL,” he explains. “Most programmers know how to write code rather than program databases, and not as many people code in SQL.”
Relational vs. non-relational
An open-source, non-relational database written in Java that can scale to thousands of servers, HBase makes many features of Google’s proprietary, high-performance distributed storage system BigTable available to the programming community. It also features a fail-safe library that runs “on top of” a server cluster — a global architecture that detects and handles failures at the local level before they spread.
With similar open-source, scalability and fail-safe features, MySQL Cluster is a relational database whose primary feature is “shared nothing” architecture — interconnected nodes that “share nothing,” including communicable failures. A shared-nothing system won’t fail if one or more nodes fail.
Fail-safe scalability, Vailaya explains, makes MySQL and HBase good candidates for storing the Semantic Web, online technology’s artificial intelligence-driven future that Internet inventor Tim Berners-Lee hails as “Web 3.0.”
The UTPA study tackled the natural next question: Which candidate would win the next-gen storage challenge?
Making sense of metadata
“In this time of unprecedented information growth, the fastest growing data category is metadata,” says study co-author and UTPA computer science professor Pearl Brazier, Ph.D. “It would be difficult to imagine the success of the Semantic Web without efficient and scalable data management tools to support its large-scale metadata-enabled applications.”
Metadata, explains Yale University pathology and bioinformatics professor Michael Krauthammer, M.D., Ph.D., is background information “the Semantic Web represents in ways that are far superior to current technologies.”
In Krauthammer’s work, metadata known as knowledge provenance, such as the researchers and institutions behind new cancer research, helps assess information credibility and research reproducibility. In a simple genomics example, the data might be a gene sequence. Metadata might include who discovered it, where and how — what laboratory techniques, which funding grants and the like.
“Metadata makes information more useful,” Krauthammer explains.
Busily standardizing the Semantic Web, Berners-Lee’s World Wide Web Consortium (W3C) has recommended two metadata-based languages: the Resource Description Framework (RDF) and Web Ontology Language (OWL).
Though not as widely-known as Web 2.0 counterparts, such as hyper text markup language (HTML) and extensible markup language (XML), the new languages are used in such high-profile projects as the U.S. census, the BestBuy catalog, Facebook pages, and the latest cancer bioinformatics research.
To first base with HBase
RDF presents data and metadata in so-called “triples” — graphs of a subject, an object and a predicate that define a relationship between them. An example the UTPA team used is a student Craig (subject) who is a member of (predicate) the technology society IEEE (object).
HBase maintains similarly-constructed tables, making it “an attractive storage alternative for RDF data,” explains study co-author and UTPA computer science professor John Abraham, Ph.D.
For the next step — retrieving stored RDF data — the UTPA team designed a new algorithm with three functions that allowed their database system — Hadoop 0.20.2 and HBase 0.90 — to evaluate queries in SPARQL, the standard RDF data query language. One UTPA function, matchBGP-DB, translates a SPARQL graph into an HBase table, for instance.
“By enabling reuse of existing database technologies, our query translation algorithms are efficient, their performance overhead is negligible, and they speed up the development of useful Semantic Web data management tools,” Brazier explains.
On nine commodity machines with 3.0 GHz 64-bit Pentium 4 processors and 2GB DDR2-533 random-access memory (RAM) networked via a PowerConnect 2724 gigabit Ethernet switch, the UTPA team tested their algorithm on sample queries from two benchmarks which standardize Semantic Web repository evaluation. The Lehigh University Benchmark (LUBM) offers 14 test queries, while the Third Provenance Challenge (PC3) benchmark offers three queries of increasing complexity based on the kind of research knowledge provenance Yale’s Krauthammer referenced.
The UTPA team’s HBase algorithm ordered datasets from the benchmarks using two criteria. First, it evaluated datasets that yield a smaller result first, decreasing iterations and memory usage. Second, it returned datasets that share variables before datasets with no shared variables, narrowing results more quickly.
In one example, the algorithm reordered a triple dataset that sought professors who taught course Y to students X. The pre-algorithm query first returned all students across all universities nationwide — an enormous dataset — then sifted through courses and finally, professors, slowly matching them up.
With the UTPA algorithm, the query returned a reordered dataset, first professors who taught course Y — a much smaller group — then students who took course Y and, finally, students matched with their professors.
Though the translation algorithm represents extra work, the result was faster queries that consumed less memory based on existing technologies.
“The extra work is inevitable if we want to store Semantic Web data with existing programs. But, solutions are delivered much faster than if we developed a Semantic Web database from scratch,” Brazier explains.
|Distributed technologies, such as HBase, that are often used in cloud computing are being explored for distributed and scalable RDF data management.|
The spark in SPARQL
A table that stores a data triplet labeled (s, p, o) — subject, predicate, object — is the starting point for the team’s SPARQL-to-SQL translation algorithm, which uses tools on MySQL Cluster 7.1.9a with names such as BGPtoFlatSQL to manipulate the same datasets from the HBase test.
“MySQL Cluster initially demonstrated a significant advantage over HBase, but that advantage generally decreased with data size growth,” says John Abraham.
Specifically, HBase performed slightly better than MySQL cluster on queries from the PC3 benchmark. On one LUBM query, MySQL Cluster performed roughly two to three times faster than HBase. But on another, HBase was three to 47 times faster.
Common to both systems on certain queries, the team observed everything from rapid performance degradation to limited or no growth in execution times with an increase in data size.
Finally, on smaller datasets from both benchmarks, MySQL generally outperformed HBase. But, on larger datasets, HBase outperformed MySQL.
|Initially, MySQLCluster demonstrated a significant advantage over HBase. However, this performance advantage decreased with growth in dataset size.|
Winner by a nose
Though the results did not reveal a blow-out, hands-down winner, since faster queries over larger datasets was the object of the game, “HBase showed superior performance and scalability,” Pearl Brazier explains.
The results weren’t surprising, but also weren’t a given.
“As an open-source implementation of Google’s Bigtable, HBase has proven capable of managing very large volumes of data,”
Brazier says. But that’s on Web 2.0, so “it was questionable if HBase was up to efficiently handling complex Semantic Web queries. Our study showed that it was.”
The team presented their findings this July at the 2011 IEEE International Cloud Computing Conference in Washington, DC.
“Keeping in mind that HBase is far less mature than MySQL Cluster,” Chebotko says, “our research team has high hopes for it.”
Mike Martin is a science and technology writer based in Columbia, MO. He may be reached at editor@ScientificComputing.com