The first week of this course focused not just on what big data is, but on some common tools to measure it and how they work, in particular SQL, Hadoop and MapReduce.
From the course notes:
- SQL is very popular for storing data, but targets structured data by design.
- Hadoop can deal with unstructured data, such as text, by providing a more general paradigm than SQL.
Hadoop can use multiple sources for it’s data.
MapReduce, unlike SQL, allows you to specify the steps that you take to get to processing the data, which allows you to be more flexible in your approach to large-scale data sets.
It was also interesting to be given examples in the course notes of where Hadoop is used – Amazon being one.
This was also a good article. This helped me make more sense of what this system is capable of and why it is used.
The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables.
The next bit talked about the Hadoop Distributed File System – obviously when large-scale ecommerce etc companies are using it, the data will be spread over many servers, so, to quote the course material:
Hadoop breaks incoming files into blocks and stores them redundantly across the cluster.
I copied and annotated the diagram used into the back of my diary as I was killing time at a local library and it was the only paper I had, so I figured I’d save it here too!