A conversation with the ‘father of Hadoop’
By Benjamin Cher December 22, 2015
- Hadoop has evolved into a key component of businesses today
- The ecosystem lacks skilled talent, the technology remains complex
IT is not often that someone sees his or her creation take flight and change the world, but the ‘father of Hadoop’ Doug Cutting (pic) happens to fall into this unique category.
The Cloudera chief architect and founder created Hadoop with Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son’s toy elephant, according to an article in The New York Times.
Hadoop was envisioned as an open source framework for the distributed storage and processing of large data sets, but its impact has grown beyond just that.
“It has progressed beyond what I imagined – I had hoped to build what would be a useful tool which would be mostly used by web companies in Silicon Valley,” says Cutting.
“I didn’t think it would necessarily be the heart of an ecosystem that is being used by nearly every industry around the world – it is not something I anticipated,” he adds, speaking to Digital News Asia (DNA) in Singapore recently.
READ ALSO: Getting business value from enterprise data
Bringing the benefit of Hadoop to all these industries has been exciting, Cutting says, adding that “it’s a better basis for what they are doing, helping them understand their data better.”
With so many applications for Hadoop, Cutting struggles to list a few examples that stand out for him, saying it’s hard to pick favourites. But he does have his ‘favourite use cases.’
“My father found his wife via an online dating service which used Hadoop to make matches between people, which was fun.
“I didn’t even suggest that he use this – he went to it on his own,” he says.
The Internet of Things (IoT) use case interests Cutting the most, from tractors and heavy equipment to airliners and non-IT industries taking advantage of the technology in a big way to drive growth and profits.
Hadoop life, relational death?
Hadoop is becoming the new default for people to build applications to use and store data.
“It’s taken over from the relational database management system – people are now first building systems on the Hadoop platform,” Cutting declares.
“While this is still in the minority, in a few years it will become the majority – data systems will be based on this ecosystem, and it’s pretty exciting,” he adds.
One of the key differentiators Hadoop has going is that it changes much more quickly than prior generations of database frameworks, according to Cutting.
“A few companies really controlled that business; however in this [the Hadoop] ecosystem, there are a few vendors but they don’t have the same degree of control.
“New technologies are invented in a variety of different places – we’ve seen it with Spark coming out of the University of California at Berkeley, and Kafka from LinkedIn,” he says.
The community – as with most open source technologies – plays a big role in getting vendors to embrace these new technologies, as the vendors will follow the users to them.
“The ecosystem is only 10 years old and is already seeing this kind of evolution in it, which has more fundamental change than we ever saw in the relational world,” Cutting says.
“We will see more of this in the coming decade – rapid improvement and addition of new technologies much more than ever before.
“Change is the new normal,” he adds.
Hadoop, we have a problem
Still, Cutting concedes there are currently a few issues standing in the way of making that vision above a reality, one of them being the problem of instant processing.
“We have seen a lot more tools that support online analysis – originally it was only for offline batch processing,” he adds.
These tools include EDGE space, which lets people build online storage systems with a real-time key value lookup; Spark streaming; and Impala.
While real-time is where people are heading to, Cutting believes that there is still place for batch processing.
“There are some things that need to be done as a batch, a lot of model building is hard if done incrementally. You can use the model once it is built to batch processes in real-time.
“We are going to see a lot more of this combination of batch-processing and real-time processing,” he ventures.
The complexity of Hadoop is another challenge, one that does not appear to be going away anytime soon.
“We need some layers to make it simpler to use – that’s an area people need to pay more attention to,” Cutting says.
“That said, it has a real advantage over prior generations [of database frameworks] in that there are a range of different options to do analysis or search,” he adds.
There is no single tool or solution to solve every problem, Cutting argues. “Not everything is just an SQL query – we have full-text search, machine learning and a range of different tools.
“This makes it hard to figure out what is best, so we need better documentation, more training as well as higher level tools to advise people what is best for them,” he adds.
Then there is the fact that Hadoop faces a talent crunch. “The biggest lack of skill is in the knowledge of these software systems,” Cutting says.
“I believe that there are sufficient people who have the industry and mathematical knowledge, who can then learn the necessary technical skills,” he adds.
Much has been done to educate both students and professional on Hadoop, according to Cutting, and while demand is high, it is still being met reasonably well.
“People can learn on the job – there are people out there with the necessary industry and mathematical knowledge, and all we need to add is the specifics about these new tools,” he says.
Smart cities and privacy
Hadoop use has been boosted by the smart cities push by governments across the world, according to Cutting.
“There’s a lot of opportunities to address the problems governments are having – managing healthcare, transportation, energy use and so on, all of which are amenable to data,” he says.
However, along with the use of such data comes the fear of privacy invasion, an issue that Cutting feels the industry should spend more time talking about.
“I think privacy is a complicated issue, and we need to spend more time talking about in the industry, building the trust and confidence of users.
“We need to tell them what data we are gathering, what we are using it for, who we are giving it to and under what circumstances, so people understand these as they are using these systems – rather than discovering later what has been done,” he says.
If this is done well, people would have a choice of using the technology or not, and will not be angry later at the choices they made, Cutting believes.
“If we don’t do this and people do get angry, we are going to see a lot of regulation that could inadvertently prevent a lot of good applications,” he cautions.
Citing healthcare as an example, Cutting notes that researchers require data from a large population size to make significant advances.
This would not be possible if there were regulations that clamped down on it, he argues.
“If we don’t permit people to collect that data, we won’t see the value – it’s the same with education, energy and all those other use cases.
“It is imperative we collect personal data to get the most value but … there is opportunity for abuse, and I think we can manage this,” he says.
“Things are moving so quickly that we haven’t had time to build those up, and it’s time we started doing that but doing so carefully,” he adds.
These are exciting times for Hadoop, with every industry becoming a technology industry, according to Cutting.
“Our economy is changing; more and more jobs are going to knowledge workers – even if they are in industries we might not expect to be knowledge-based.
“From transportation to agriculture, there is more and more we can automate … all these things that take the intelligence of people.
“We are going to see a big transition in society as these technologies pervade industries,” he adds.
Rising Hadoop use, but deployment challenges remain
A life in data, a handbook for data scientists
Maximising the value of big data analytics
Singapore’s Hive to bring data science goodness to the people
For more technology news and the latest updates, follow us on Twitter, LinkedIn or Like us on Facebook.