A lot of new terminology has been growing of late, Date Engineer, Data Scientist, Data Pipelines, Big Data. Pick and choose your favourites, list the ones I didn’t even mention. But what does it all mean? What is Big Data? What is a Data Engineer or a Data Scientist, they don’t even Engineer or Science? What is the difference between the two roles anyways? To answer all of those questions, we need to lay down a foundation first. So let’s start from the ground up and discuss what data is, and why it has become so important.
Google defines Data as “facts and statistics collected together for reference or analysis.”
Facts and statistics? This still feels a bit like a broad term, so let’s make it much simpler than all of this technical terminology. Let’s say you go down to your local convenience store to buy a loaf of bread, what data can be generated by this interaction?
First and foremost, stock data. The convenience store needs to keep a steady supply of bread available, how do they know when to order new stock? Or how much stock do they need to order? They need to keep a journal of some sort regarding the stock level of bread. And the different kinds of bread as well, which loafs sell faster than others?
From this interaction we now have facts. How much bread was sold today. How much bread is still left on the shelf. Which bread is still left on the shelf. From collecting these facts over a period of time we can now start generating statistics. Which bread needs to be stocked next and how often do we need to order. What is the bread type that sells the most and what should the expected sales forecast be?
And that is just looking at the loaf of bread itself, we haven’t even started to consider the financials. What income is generated by these sales, how much budget should be allocated to the restock of each item. And the more we look at the scenario the easier it is to find items to measure and group together for analysis.
Data is pretty much all around us on a constant basis, we are surrounded, if not drowned by information. And not only by the technical hustle and bustle of mobile devices and television sets, or even the internet. Very simple interactions can generate multiple data sets that can be analysed to the nth degree.
Now that we have a fairly good idea of what data can be, what does it mean when people talk about Big Data? Well, it’s actually quite simple, Big Data is simply just a lot of data. But like massive amounts of data. Data that is not going to fit on all of your external harddrives put together. Data of this size requires specialised housing, just as with all other commodities, we require a Warehouse to store it all in.
And that is where our Data Engineers get involved, as a Data Engineer you should have the skills and knowledge to understand how the data is generated and thus how to capture and store it. From your clients requirements you should be able to determine what kind of data pipeline will fit their needs and how big their warehouse will have to be, each also coming with their own levels of complexity. It is thus pertinent that the Data Engineer involved has a thorough understanding of the various tools and methodologies at their disposal to ensure that the data is not only stored and captured, but that it is done so effectively, accurately and with cost in mind.
It is here that the Data Engineer and Data Scientist roles start to see overlap. Once this data has been stored, it requires refinement for reporting and analytical purposes. This refinement will ensure that the data can be used for daily Business Intelligence reporting, as well as for the analytics that will be run by the Data Scientist. The models being run by the Data Scientist are only as accurate as the data it is receiving and processing, it is thus imperative that this data is of the highest quality.
As you can see the simplest interaction can already generate data for us, and that data can become very powerful if used correctly. Imagine what can be done if the data being generated in your personal life and work life where to be used to the fullest extent. And this kind of data work is not just isolated to the Information Technology or to Financial Institutions, but also to areas such Aviation, Medical Care and as we recently spoke about at Africa Agri Tech, even Agriculture.