However, big data is to be honest much more then social media generated content which can be processed and analyzed. Big data is in general talked about by the four V's. Oracle has identified the 4 V's as Volume, Velocity, Variety and Value while IBM is giving the 4 V's another meaning.
Other sources of big data can be very well the output of the sensors in your factory, sensors in a city, a large number of other sources from within your company or outside of your company. Big data can be the output of devices from the internet of things as described in this blogpost.
Whatever the source of your "big" data the common factors are in most cases that their is a large volume of it, it is coming in a variety of formats and it coming at you fast and continuous. When working with big data the following 3 steps are common;
- Acquire
Their is the need to acquire the data from a number of sources. Within the acquire process their is also the need to store the data. Acquire is not necessarily capturing the data. In the example of sensor data the sensor will capture the data and send it to the acquire process.
- Process
When data is acquired and stored it will need to be processed. In some (most) cases this will be organizing the data so it can be used in analysis however it can also be very well processed for other means then analysis.
- Use
Using the processed data is in many cases analyzing or further analyzing the data that comes out of the process step. However it can also be input for other, non, analytic business processes.
In the below you can see the vision from Oracle on those steps in which they acquire, organize and analyze data.
Even though the above is giving a good example a step is missing from the visual representation. As it is shown now the acquire phase is displayed as stored data in HDFS, NoSQL or an transactional database. What one of the big parts of a good big data strategy should hold is a step before it is stored and that is receiving the data. One of the big parts where a lot of time will have to be dedicated is creating a good technological capture strategy. You will have to have "listeners" to which data producers can talk and who will write the data to the storage.
If you take for example the internet of things you will have a lot of devices that will be sending out data. You will have to have receivers to which those devices talk. As we have stated that that big data is characterized by volume, variety and velocity this means that you will have to build a number of receivers and those receivers should be able to handle a lot of messages and data. This means that your transceivers should be able to, for example, balance the load and work in parallel to be able to cope with the load of all devices that send data to them. Also you will have to ensure that the write speed of the data that the transceivers need to store the data is in line with the supply of data that is send to the transceivers.
An example of where such topics where an issue and where handled correctly is the CMS DAQ system developed by the TriDAS Project project team at cern when developing data capture triggers for the Large Hadron Collider project. In the below video Tulika Bose a assistant professor from the University of Boston gives a short introduction to this.