The origins of Big Data
Now that we know what big data is, the bigger question is — where did it come from, and how? In historical terms, the internet, data and data transfer are still very, very young — it has been only just over 5 decades since we could process that kind of information in anywhere close to the speeds we have today.
So let’s travel a bit back in time to see how we got here. And make no mistake, the kind of data we’re dealing with now, may have been only a pipe dream for scientists way back when.
Although the term itself has been attributed to a number of people, case studies, papers and even works of fiction, the most recurrent — and likely — belief is that the origin of ‘Big Data’ as a buzzword, a catchphrase — and a term we all know and use today, is with American computer scientist John Mashey in the early 1990s. Yes, that recent!
As I told you in our last post, big data usually means data sets of an unprecedented size. Today, I’ll clarify a bit on that.
Big Data deals with far, far larger volumes of data that we knew in the past — and by past I even mean less than two decades ago. The very very recent past. The catch here is that while there may have been large data sets before, they were organised into smaller data sets, or given a modicum of organisation by human scientists, so even those larger volumes of data had some sort of direction, or underlying order to it.
The kind of big data we’re dealing with today does not. And that is where the term ‘big data’ is used — when the data volumes don’t have a system or order to them, and the goalposts for that data and the conclusions to Draw from it, are constantly moving.
The most popular theory for this is from data firm Gartner, who back in 2001 came up with the 3V’s Theory of Data [1]. Those three Vs?
Volume
Velocity
Variety
Now, I’ll get into each so you’re able to understand them better.
Volume
This may seem self-explanatory, but I’ll explain it to you anyway. Volume simply discusses the amount of data out there. Let’s look at an example that we use in our everyday lives, because its implications are massive. Many, if not most of us, use Facebook, and quite actively so in many cases.
As of the third quarter of 2017, Facebook has a mammoth 2.07 billion users. Each of those users, individually, generates a great amount of data — in terms of surfing habits, pages they access, demographics they belong to — even the information you give Facebook when you sign up is part of the massive sets of data they handle. It’s a tiny drop in the ocean, but it contributes to the volume of big data.
Facebook can then use that data for many things, on an individual and community basis. For example, it can predict — and target — ads that are relevant to an individual based on the pages or websites they access or the videos they view. Have you ever seen an ad for athletic wear or shoes after looking around at fitness pages on Facebook?
That’s just one small part of how that volume of big data can be processed and used.
They can also use that data on a large scale — say, to see which brands are popular among a certain demographic. For example, Indian users between the ages of 18–23 might gravitate towards using a certain set of cosmetics brands, or apparel brands. That also helps Facebook and those brands.
Essentially, every user is a walking, and very significant set, of data.
And if one person generates that much data (and more), imagine just how much data is generated by 2.07 billion users. Just under one-third of the entire world population!
Velocity
You’ve probably studied this term in a high-school physics class. Velocity translates to speed — and this is significantly tied into the volume of data. As I told you all earlier, there are a huge number of users on the internet, and each are generating a huge amount of data. That data is also being measured in a very short span of time — a few seconds or less. There is a need to be able to control and understand that data in real-time.
And that’s where big data comes in.
With the speed at which new data comes in — and while doing so, even makes some old (and by old, I could even mean minutes) data obsolete, there is a need to manage that data in real-time. That data could be in small bits, it could be in streams, it could be in large batches. But there is a real need to be able to handle and process it in that real-time that means big data, and the methods to deal with it, are employed.
Variety
Here, I’ll go back to last week’s article — if you haven’t read it, please do! I talked about various types of data analysis — and how machine learning and artificial intelligence can be used to cope with, and process and understand, and then, make sense of that data. This dials it a step back to see what kinds of data there are.
And there are many, many types. Data could be numerical — or it could be text based. I’ll get into the example of social media again: the data could be in the form of shares — or deeper in, the metrics of those shares. That means it could be in a visual format — images, or text, or in percentages and numbers for Facebook, or the companies advertising or working on it.
Data can be numbers in cells, like so many of us use on a daily basis — It could be plain text, typed into your notepad or written down by hand — there are no restrictions on what constitutes data, but all of it can be parsed together. It is the how that is important.
Newer data also brings variability with it. Say I’m on Twitter, and I’m looking to analyse a trend — trends are fickle, and if not paid for (that’s another beast in data altogether!), will change very quickly. This is closely related to the trends I just spoke about, and here’s why: those trends, like I told you, are fickle, and can, and will change at a moment’s notice. The data around it doesn’t necessarily have any structure either, which poses a unique problem in how to parse it — which is where big data methods come in — the same methods you can refresh on from last week and the first of which I’ll go into detail on, next week.
That’s all for this week — but be sure to read on as I explain just how businesses — and people — tackle this data, and the many ways there are to do it.