I.

Introduction to big data

As we have seen throughout the course, data plays a critical role in our society and enables us to get an understanding of the world around us. For the past decades, the explosion of the internet and Web 2.0 services, as well as mobile devices and sensors, have led to the creation of massive data sets.

The combination of a "growing torrent” of data generated and the availability of on-demand computing technologies (like cloud computing) has led to the development of the big data concept, referring to data that exceeds the processing capacity of conventional database systems.

Big data definitions

Big data is usually defined as "large amounts of data produced very quickly by a high number of diverse sources".

Big data definitions are subjective in terms of how large a dataset should be to be considered big data. There is no reference to the number of bytes, which is how we usually measure data (for example gigabytes). With technology advancing fast, and more and more devices connecting to the internet, the amount of data being created is also increasing.

The size of the datasets that qualify as big data might also increase over time. Also, what is “big” for an organisation, a sector, or a country may be small for another – think of Apple compared to a small business, or Portugal compared to China.

Example

We create enormous trails of data

In 2020, we experienced one of the biggest and most global challenges ever experienced. We were already “connected” but suddenly, every aspect of our life, from exercising to work and study, moved online. Shops, gyms, offices, restaurants and cinema theatres were closed. The only way to work (for those who were not on the frontlines), study, communicate, buy furniture, socialise or watch a movie, was through the internet. We couldn’t even visit and hug our families.

This situation has made the world even more digitised. In a given day, any of us could:

  • Communicate using WhatsApp messages

  • Browse or search for something online

  • Buy groceries, services or equipment online

  • Share a cute photo of our furry friend or a work document

  • Watch a series on Netflix or Amazon Prime Video before going to bed

  • Listen to music from SoundCloud, Spotify or YouTube

  • Buy and read a book on an e-reader

Multiply that by the millions of users who use their phones or computers (or both!) every day.

Data icons growing from footprints on the ground
Data icons growing from footprints on the ground

Your digital footprint

Almost every action we take today leaves a digital trail. We generate data whenever we carry our sensor-equipped smartphones, when we search for something online, when we communicate with our family or friends using social media or chat applications, and when we shop. We leave digital footprints with every digital action, and sometimes even unaware or involuntarily.

Have you wondered how companies like Amazon, Spotify or Netflix know what “you might also like”? Recommendation engines are a usual application of big data. Amazon, Netflix and Spotify use algorithms based on big data to make specific recommendations based on your preferences and historical behaviour. Siri and Alexa rely on big data to answer the variety of questions users may ask. Google Now is able to make recommendations based on big data on a user's device. But how do those recommendations influence how you spend your time, what products you buy, what opinions you read? Why do these big companies invest so much money in them? Do they only know you, or also influence you? Although recommendation systems account for up to a third of all traffic on many popular sites, we don’t know the power they have to influence our decisions.

Example

What does your phone know about you?

Have you ever wondered what your smartphone knows about you, about your behaviour, your feelings, your mood or health situation? Smartphones have many powerful sensors that are continuously generating data about you, making your life easier. Where is the line between privacy and data protection and convenience? That’s for you to consider and decide.

Big data combines structured, semi-structured and unstructured data that can be mined for information and used in machine learning, predictive analytics, and other advanced analytics applications. Structured data is data that can be arranged into rows and columns, or relational databases; and unstructured data is data that is not organised in a pre-defined way, for instance Tweets, blog posts, pictures, numbers, and even video data.

Organisations use specific systems to store and process big data, which is called data management architecture.

Characteristics of big data

The most widely accepted characterisation of big data follows the three Vs coined by Doug Laney in 2001: the big volume of data being generated, the broad variety of data types stored and being processed in big data systems and the velocity at which the data is generated, collected and processed. Veracity, value and variability have also been added to enrich the description of big data.

IBM Big Data & Analytics Hub created an infographic which explains and gives examples of each of the first four Vs.

  • Volume means the amount of data being generated/collected every moment in our highly digitised world, measured in bytes (terabytes, exabytes, zettabytes). As you can imagine, there are many challenges caused by the enormous volumes of data, like storage, distribution and processing. The challenges mean cost, scalability and performance. The volume is also driven by the increase in data sources (more people online), higher resolutions (sensors) and scalable infrastructure.

Note

Every day, 2.5 quintillion bytes of data are created. That's equal to 10 million Blu-ray Discs every day. 95 million photos and videos are shared every day on Instagram, 306.4 billion emails are sent, and 5 million Tweets are posted. There are 4.57 billion active internet users around the world. All our devices generate, gather and store data.

  • Velocity refers to the speed at which data is being generated, non-stop, near or real-time streamed, and processed using local and cloud-based technologies.

Note

Every second, one hour of video is uploaded to YouTube.

  • Variety is the diversity of data. Data is made available in different forms such as text, images, tweets or geospatial data. Data also comes from different sources, such as machines, people, organisational processes (both internal and external). Drivers are mobile technologies, social media, wearable technologies, geotechnologies, video and many, many more. Attributes include the degree of structure and complexity.

  • Veracity refers to the conformity to facts and accuracy. Veracity is also the quality and origin of data. Attributes include consistency, completeness, integrity and ambiguity. Drivers include cost and the need for traceability. With the high volume, velocity and variety of data created, we need to question: is the information real, or is it false?

There are more emerging Vs, but we will just mention one more, value. It refers to our capacity and need to turn data into value. Value doesn’t mean just profit. It may be related to security and safety (like seismic information), medical (wearables that can identify heart attack signs) or social benefits like employee or personal satisfaction. Big data has a large intrinsic value that can take many shapes.

The Vs not only characterise big data, they also embody its challenges: enormous amounts of data, available in different formats, largely unstructured, with varying quality, that require fast processing in order to take well-timed decisions.

Why and how is big data analysed?

80% of data is considered to be unstructured. How do we get reliable and accurate insights? The data must be filtered, categorised, analysed and visualised.

Big data analytics is the technological process of examining big data (high-volume, high-velocity and/or high-variety data sets) to uncover information – hidden patterns, correlations, market trends or/and customer preferences – that helps organisations, governments or institutions to examine data sets and obtain insights in order to make informed, smarter and faster decisions.

This addresses three important questions: what, why and how. We’ve already seen the what, so we will now get an overview of the why and how.

The why and how of big data

Big data follows the principle that “the more you know about something, the more reliably you can gain new insights and make predictions about what will happen in the future”.

A typical data management lifecycle includes ingestion, storage, processing, analytics, visualisation, sharing and applications. The cloud and big data go hand in hand, with data analytics happening at public cloud services. Companies like Amazon, Microsoft and Google offer cloud services that enable a fast deployment of massive amounts of computing power, so companies can access state-of-the-art computing on demand, without owning the necessary infrastructure, and run the entire data management lifecycle in the cloud. In the previous section we spoke about SaaS, IaaS and PaaS – cloud computing offers big data researchers the opportunity to access anything as a service (XaaS).

Pre-processing

Raw data may contain errors or have low-quality values (missing values, outliers, noise, inconsistent values) and might need to be pre-processed (data cleaning, fusion, transformation and reduction) to remove noise, correct data, or reduce its size. For example, for water usage behaviour analysis, data pre-processing is necessary for smart water meter data to become useful water consumption patterns because IoT sensors may fail to record data.

Graphs with data patterns
Graphs with data patterns

Identifying patterns or insights

The automated process behind big data involves building models based on the collected data and running simulations, modifying the value of data points to observe how it impacts our results. The advanced analytics technology we have available today can operate millions of simulations, tweaking variables in a quest to identify patterns or insights (finding correlations between variables) that might provide a competitive advantage or solve a problem. Behavioural analytics focuses on the actions of people, and predictive analytics looks for patterns that can help in anticipating trends.

Example

As an example, let’s look at business intelligence (BI). BI is the process of analysing data with the objective of delivering actionable information that helps executives, managers and workers make informed business decisions. Business intelligence focuses on business operations and performance. The data required for BI is different, more elaborated. Big data systems have raw data that needs to be filtered and curated before being loaded and analysed for BI purposes. The tools used are also different, since the objective and data are different.

Data mining

The process of discovering patterns from large datasets involving statistical analysis is called data mining. Statistical analysis is a common mathematical method of information extraction and discovery. Statistical methods are mathematical formulas, models and techniques used to find patterns and rules from raw data. Commonly used methods are regression analysis, spatiotemporal analysis, association rules, classification, clustering and deep learning.

Example

An example of the practical use of big data is seen in mobile phone data. Usage data from phone sensors can be used for usage-based insurance (UBI). Sparkbit offers a customised insurance offer to drivers based on their behaviour. Their system uses the information from smartphones to evaluate the technique and driving behaviour. In March 2018, they had accumulated 330 million kilometres of historical routes made by their system users. They have 30,000 new active users a month, each registering an average of 70 new routes. A sequence of points from the GPS (geographical coordinates, estimated position accuracy, vehicle speed or direction in which the vehicle moves) are created for every drive. The system stores the data, processes it and analyses driver behaviour (such as dangerous driving), and issues a point score for the route and driver.

To make sense of the available data, cutting-edge analytics involving artificial intelligence and machine learning are commonly used. With machine learning, computers can learn to identify what various data inputs or combinations of data inputs represent, identifying patterns much faster and more efficiently than humans.

Next section
II. Applications and implications of big data