Increase business agility, tap into open source big data projects
By Jason Bissell and Calvin Hoon June 19, 2018
- Open source is already the default option across several big data categories
- The open source culture is open-minded, innovative and collaborative
TWENTY years ago, the Open Source framework was published, delivering what would be the most significant trend in software development since that time.
Whether you want to call it "free software" or "open source", ultimately, it’s all about making application and system source codes widely available and putting the software under a license that favours user autonomy.
According to Ovum, open source is already the default option across several big data categories ranging from storage, analytics and applications to machine learning.
In the latest Black Duck Software and North Bridge's survey, 90% of respondents reported they rely on open source “for improved efficiency, innovation and interoperability,” most commonly because of “freedom from vendor lock-in; competitive features and technical capabilities; ability to customise; and overall quality.”
There are now thousands of successful open source projects that companies must strategically choose from to stay competitive.
While every company must develop its strategy and choose the open source projects it feels will fuel its desired business outcomes, there are some projects that we feel are worth strong consideration.
How open source can be your path to business agility
Following are a few of the big data open source projects that have the largest potential for enabling companies to have extreme agility and lightning fast responses to customers, business needs and market challenges.
Apache Beam is a project model that got its name from combining the terms for big data processes batch and streaming because it’s a single model for both cases. Under the Beam model, you only need to design a data pipeline once, and choose from multiple processing frameworks later.
Your data pipeline is portable, and flexible so that you can choose to make it batch or stream. This way, your team can benefit from much greater agility and flexibility to reuse data pipelines and choose the right processing engine for multiple use cases.
Apache Airflow is ideal for automated, smart scheduling of Beam pipelines to optimise processes and organise projects.
Among other beneficial capabilities and features, pipelines are configured via code rendering them dynamic, and metrics have visualised graphics for DAG and Task instances. If and when there is a failure, Airflow has the ability to rerun a DAG instance.
Apache Cassandra is a scalable and nimble multi-master database that enables failed node replacements without having to shut anything down, and automatic data replication across multiple nodes.
It’s a NoSQL database with high availability and scalability. It differs from the traditional RDBMS, and some other NoSQL databases, in that it is designed with no master-slave structure, all nodes are peers and fault tolerant.
This makes it extremely easy to scale out for more computing power without any application downtime.
Apache Carbon Data is an indexed columnar data format for incredibly fast analytics on big data platforms such as Hadoop and Spark.
This new kind of file format solves the problem of querying analysis for different use cases. With Apache Carbon, the data format is unified so you can access through a single copy of data and use only the computing power needed, thus making your queries run much faster.
Apache Spark is one of the most widely utilised Apache projects and a popular choice for incredibly fast big data processing (cluster computing) with built-in capabilities for real-time data streaming, SQL, machine learning, and graph processing.
Spark is optimised to run in memory and enables interactive streaming analytics so you can analyse vast amounts of historical data with live data to make real-time decisions, such as fraud detection, predicative analytics, sentiment analysis and next-best offer.
TensorFlow is an extremely popular open source library for machine intelligence which enables far more advanced analytics at scale.
TensorFlow is designed for large-scale distributed training and inference, but it is also flexible enough to support experimentation with new machine learning models and system-level optimisations.
It is very readable, well documented and expected to continue to grow into a more vibrant community.
Docker and Kubernetes are container and automated container management technologies that speed deployments of applications.
Using technologies like containers makes your architecture extremely flexible and more portable. Your DevOps process will benefit from increased efficiencies in continuous deployment.
As impressive as each of these open projects are individually, it is the collective advances that best illustrate the huge impact the open source community has had on the enterprise and the monumental shift from legacy and proprietary software to open source-based systems — enabling companies of all sizes, across all industries to increase speed, agility, and data-driven insights at all levels or their organisations.
How can companies prepare for the OSS changes ahead
While the changes that have already occurred are quite breath-taking, this is not the end of the story for these and other market-shaping forces.
There are several ways to help companies leverage on the sea change that has already occurred and to adapt to the innovations yet to come from the mashup of open source, cloud and big data.
Become an open source champion in your business
Join the opens source communities relative to your projects and interests. Educate yourself, your team and management on its benefits. Determine what you can leverage on instead of “reinventing the wheel”.
Contribute to open source projects
There are a lot of companies that use open source today, but unfortunately many of them do not contribute. By contributing upstream to the project, others may benefit from your work, but your company also benefits from their work. It means more feedback, more new features, more potentially fixed issues.
Become an influencer in open source projects key to your company
By contributing to the OS community, companies develop influence in the open source community on projects important to your company’s progress. That influence helps you direct changes to the project that will be of particular benefit to your company’s projects.
Change the business culture to open source
The open source culture is open-minded, innovative and collaborative. Embracing transparency allows the team to accept the different feedbacks with grace, be open-minded and accepting of change.
Change has always been the only constant in human existence and business. But change is happening faster now than at any other time in history.
By staying open-minded, attuned to open source, and aware of the many ways to use data and analytics, you’ll be well prepared for whatever pops up next on the horizon.
Jason Bissell is the general manager of Asia Pacific and Japan and Calvin Hoon is the regional VP of Sales, Asia Pacific at Talend Inc.