To define Big Data, you need to take a close look at the characteristics of Big Data. Earlier it’s definition was mostly limited to the 3V model, and while this model for analysis is undoubtedly very crucial and correct, two more essential factors were added to have the 5Vs of Big Data as under
Volume is the total amount of data that is generated to understand data-based decisions. The length of a text-based file, audio, video, images varies from KBs to GBs. The enormous volume of data is the major characteristics of Big Data. The volume is so high that it cannot be analyzed using conventional data processing techniques. It requires data mining, data storage, data analysis, data sharing, and data visualization. Such large data is sourced from various sources like Social Media, business processes, machines, networks, human interaction, and are stored in data warehouses. There are many options for data warehousing, and the most popular choice is Cloud Services.
Example: Amazon handles 15 Million customer click-stream user data per day to recommend products to its users.
Velocity measures the rate at which data is generated and modified, and the speed at which it is processed. There are billions of people uploading millions of pictures on Facebook every day, and the number seems to increase with each passing day. The velocity of data would be how fast the data is coming in. Facebook has to handle a plethora of photographs every day. It consumes the data first, processes it, files it and makes it able for the user to retrieve it sometime later.
Data in earlier times used to be delivered in 1 format after data was being collected from one source. But now, after taking the shape of database files such as Excel, CSV, and Access, they are now presented in a non-traditional form such as — audio, video, images, etc. Data heterogeneity is sometimes troublesome in building a data warehouse. Big data that comes from a whole lot variety of sources is generally structured semi-structured and unstructured data. The variety of data thus needs distinct processing capabilities and specialized algorithms.
Veracity refers to the quality, accuracy, and reliability of data that is to be analyzed. High-velocity data has records that are too valuable and contribute in a meaningful way to derive results, whereas low veracity data is instructed and meaningless data. When processing Big Data sets, it is hence important to check the validity of data before proceeding to process. It is a difficult task indeed considering biases, lack of data lineage, information to identify where the data has been stored in various databases, bugs that cause data to calculate incorrectly, abnormalities, untrusted or robotic sources, fake news sources, the uncertainty of statements, expired data, human error, etc.
Data is often inconsequential. Analysts use continuous processes and options to measure the value of data, that can be either structured or unstructured. The real value of data is in its potential to improve decision-making capabilities. The real value of data is found by adapting the other four “V” of Big Data: Volume, Velocity, Variety, Veracity. Big data has values. It does not matter how voluminous the data is, but if smart organizations cannot use that massive amount of data, then it’s of no use. Having tons of data but not using it, intelligently seldom delivers the results an organization is looking for.