Hadapt: Yale Startup

By Sherwin Yu

March 14, 2012

Daniel Abadi, Assistant Professor of Computer Science, has taken one of the biggest gambles of any academic career; he has started a company before receiving tenure, called Hadapt. The goal of Hadapt is to commercialize Abadi’s research on distributed database systems to build an adaptive data analysis platform by combining the strengths of two existing technologies: relational databases and MapReduce. To fully understand the concepts behind Hadapt, it is necessary to understand recent developments in database research and industry.

Relational Databases

Traditional approaches to data management include relational databases. These databases organize data in a structured manner based on relations found in the data. For example, a store might keep information about customers, current products, and customer purchase history. These are all components related to one another: multiple customers can purchase a product, and each customer has a single purchase history. These individual components also have information attached to them, such as a customer’s name or a product’s price. Relational databases allow researchers to form complex queries, such as “What are the names of all customers who purchased at least two products costing more than $10.00?”

Because relational databases are effective at managing this sort of structured information, they have become the dominant model for storing information for many businesses and industries. Many business intelligence tools interface with relational databases through a common language called SQL.

Today though, many institutions, academic and commercial alike, generate massive amounts of unstructured data. Wal-Mart tracks supply and demand transactions, eBay tracks clicks, and Verizon tracks calls. Likewise, genomics, particle physics, and meteorology are all fields that deal with a lot of data. “So much data,” Abadi explains, “that you couldn’t possibly hope to fit it on a single computer.”

The solution is to use more than one computer, both for storing data and for doing computations.

fulllength-hadapt-3 — A surge in the availability of unstructured data has occurred in recent years. Traditional relational databases are not equipped to handle such large data sets. Adapted with permission from Daniel Abadi

MapReduce and HadoopDB

In 2004, Google invented MapReduce, a framework for distributed computing on large distributed data sets. Hadoop is an open source implementation of MapReduce, supported by many companies in the industry. MapReduce and Hadoop are identical conceptually. They both split data processing into two phases: map and reduce. In the map phase, many computers perform the same task but on different parts of the input data, all in parallel. Data emitted from the map phase then proceed to reducers, which combine the intermediate data, again in parallel, to yield the final result. The input data are typically stored in a distributed file system, on multiple computers, and MapReduce programs are deployed on the same cluster of computers.

While it is possible to use multiple computers to build parallel databases, Hadoop’s primary advantage is its scalability. Parallel databases scale up to hundreds of computers, but Hadoop can surpass even that, capable of efficiently utilizing the resources of a thousand or more machines.

However, traditional databases still have several advantages over Hadoop. First, Hadoop was not designed for structured data analysis, and so, on structure data tasks, relational databases can still outperform Hadoop. Second, many business intelligence tools are built on top of SQL and relational databases. Hadoop does not support SQL and therefore is incompatible with many existing analysis tools.

So Abadi and his graduate students began working on HadoopDB, which eventually became Hadapt. Hadapt uses a hybrid of Hadoop and relational databases. Each node in a Hadapt cluster runs its own relational database (for structured data) but also contributes to the distributed file system (for unstructured data). This combines the scalability of Hadoop with the structured data performance of relational databases.

fulllength-hadapt-2 — In the map phase, the input data is split between many map node computers. The data is processed in parallel, and then in the reduce phase, is combined to yield the output. Courtesy of the author.

From Research to Start-up

Though HadoopDB was an academic research prototype, Abadi was hoping that the open source community would pick it up. Before Abadi published his paper, he blogged informally about it in the summer of 2009, and the HadoopDB generated a lot of interest.

His blog post received over 5,000 views, and Abadi started getting calls from venture capital firms asking to commercialize the research. At first, he simply said no, as he had no intention of commercializing. “Generally as a junior faculty member you don’t start a company. It’s just not something you do,” Abadi explained. Instead, the expectation is that junior faculty members concentrate their efforts on activities that help in the tenure evaluation, such as publishing papers.
Yet, Abadi wanted his research to be implemented and put to use, to have real-world impact. But there are significant barriers to industry adoption of academic research. “It’s almost impossible to get other people to actually adopt your design in real-world deployments unless an extensive and complete prototype is available or your design is already proven in-real world applications.”

Abadi explains that there are a few options if a researcher wants his or her system design research to make an impact. First, a full prototype ready for real use must be built using lab resources. Limited funding and the considerable amount of work required hinder this.

An alternative to developing a prototype is to leave academia to join a company that has resources devoted to research, such as Google, Yahoo, or Facebook. But, “if I didn’t love other aspects of being part of an academic community so much, this is certainly what I would do,” explained Abadi.

The last option for researchers is to develop a start-up to build the prototype after raising money from angel investors and venture capital funds. Start-ups face significant challenges as well. There is a high risk of failure, and the project requires an extraordinary amount of time and effort. Conducting market research, meeting with investors and customers, and dealing with patents are just a few of the time-intensive tasks that the researcher would have to handle. Though Abadi has always had a bit of entrepreneurial spirit and fully intended on starting a company at some point, he believed that the time and energy required were too much for a pre-tenure professor.

So when Justin Borgman, the current CEO of Hadapt, approached Abadi about commercializing, the young professor said no, just like he had before. But Borgman persisted and, after many months of discussion, Abadi began seriously considering commercializing his research. In early 2010, Borgman and Abadi sat down to work; by the end of 2010, they had worked out licenses with the Yale Office of Cooperative Research and had raised a seed round of capital for an undisclosed amount, enough to last a year and a half.

Borgman and Abadi’s intention was to raise another round of funding in June 2012 once they make more progress, but one of the investors made a pre-emptive offer. Hadapt raised $9.5 million this fall, and Abadi says they are looking to grow quickly.

Abadi’s previous experiences with academic start-ups prepared him for developing Hadapt. While he was a PhD student at MIT, his advisor started three companies, one of which was very successful. Abadi saw his Ph.D. research commercialized by Vertica, which was sold recently to HP. Early on, he contributed to the non-technical side of those companies as well, sitting in on business strategy meetings. Abadi also had the opportunity to take business classes as a graduate student at MIT. Overall though, “classes are great, but nothing beats just doing it. Working on Vertica taught me a lot.”

Though Borgman is primarily in charge of the executive duties, such as contacting investors, Abadi plays a critical role as an expert on the database systems industry. “Going from technology to a business application is a big jump. Do they want a database system, or do they want a data warehouse? Do they want a tool for data mining? Do they want a vertical specific solution?” Abadi’s deep understanding of the industry guided their answers to these questions. Keeping up to date on the industry, Abadi explains, is essential in an applied field such as databases. Many companies have very challenging problems that are related to research problems. “If you’re a researcher, you want to work on interesting problems, which Twitter, Facebook, and Google actually have. Each company might have very specific problems, but the general problem is a research project.”

Today, Abadi is still a full-time professor at Yale and serves as the Chief Scientist at Hadapt. He is concentrating on his research but points out that his research mirrors his work. With Hadapt, he has the added benefit of seeing his work have real-world impact. “It’s the greatest gamble of my life, but I’m having a blast and it’s a lot of fun.”

About the Author
SHERWIN YU is a senior in Morse College studying Computer Science and Molecular Biophysics and Biochemistry.

Acknowledgements
The author would like to thank Professor Abadi for his help on writing this article.