Top 7 things to do to succeed in a Big Data project
- June 11, 2020
- Posted by: ameya_admin
- Category: Big Data ,
1. Ironically, don’t call it as Big Data!
Big Data refers to data which due to its 4 Vs, namely Volume, Variety, Veracity and Velocity, is deemed unfit for handling with conventional database technologies. But from a program / project viewpoint, we are more likely referring to a Data Lake which is a central repository of all structured / unstructured data stored at any scale from which analytics of various types can be run. To summarize, a basic understanding that it is a Data Lake that you process for any Big Data project is required.
2. It should be an Enterprise-level program / project and be treated as one.
Big Data projects should always be considered and approached as an Enterprise level program / project. Why? As illustrated in the above point, we are working with a Data Lake which is enterprise-wide, a top-down approach is needed here. It also ensures that everything is in alignment with the corporate data strategy. Individual department Big Data projects should all go under this umbrella as sub-projects. Being an enterprise program / project means that it needs involvement of enterprise-level architects / governance teams / auditing / security.
3. Uber important is to employ the right resource.
While there are many important factors that determine the success of a Big Data project, the topmost is to have the right resource as this could mean the difference between making or breaking your project. Make sure that the right resources with the required knowledge knowhow are employed. Also related to this is deploying the most suited tools for the assignment at hand. Using unsuitable tools, like cutting a tree with a pair of scissors or a plant with an axe, will invariably do more harm than good not to mention the valuable money and time lost on the endeavor.
4. The Data Lake Platform should be generic.
The Data Lake platform used should be generic at all layers with capabilities of plugging in other software. The platform code should not be vendor locked and be adaptable and flexible enough to be able to integrate with any other software. The platform used should support global standards like JDBC connections, Rest API, ODBC, FTP, SSH etc. Also, currently when cloud storage and computing has become the order of the day, very importantly the platform should support cloud integration capabilities straight out of the box.
5. Clearly define the different zones / layers of the Data Lake at the start.
During the architecture phase / initial set-up of the Data Lake, a critical thing to do is set up different zones / layers for ease of maintenance and operability. Not doing this will result in the Data Lake being more like a Data swamp! All zones across the lake should have consistent naming conventions with each having specific and distinctly defined security protocols, data governance structures and lifecycles. It is akin to calling a city plan as proper only when it has clearly marked zones for educational institutions, commercial centers, recreational regions etc.
6. Tools that you deploy should be configuration based and not code based.
As we understand, anything that is code based is time consuming and expensive. Hence whenever and wherever feasible, go in for configuration-based tools that are codeless and most suited for the Data lake at hand. Some of the many advantages with this are faster development lifecycles, lesser maintenance and greater ease of switching to other platforms.
7. Treat each project within a Data Lake as a separate entity.
The rationale for this is quite simple. Each project in a Data Lake will come with its own unique set of requirements and hence it’s only logical that we treat each as a separate entity. The requirements of each project should start at the ingestion zone and end at the consumption zone. Data sharing with other teams / departments within the ingestion zone (which is the only place where it can happen) should be flexible, of course dependent on the information sensitivity.