Ahmad Fattahi

PI System and Hadoop? Anyone?

Blog Post created by Ahmad Fattahi Employee on Mar 14, 2012

[If you are already familiar with Hadoop you can only read the last two paragraphs] There is a strong trend in the Big Data community about Hadoop and all different layers and implementations of it. Google started the MapReduce software framework along with the GFS as the accompanying distributed file system. Then came along the open source version expanded by Yahoo and others delivering MapReduce and HDFS as the software and file system frameworks respectively. Nowadays a lot of Big Data organizations have their own Hadoop clusters capable of handling many terabytes and petabytes of data with some good  level of fault tolerance.


The main advantage of Hadoop lies in its cheap implementation, flexibility to lack of structure in data, and fault tolerance. In fact, the justification is that we can provide sufficient redundancy so that we can always be available in face of normal hardware and software failures using Hadoop. The redundancy is of course possible because of low hardware cost. Also, we don't need to process all the data that comes in right on the spot; therefore, Hadoop can store unstructured data for future processing. The downside is that sometimes you get incoherence among data nodes. Also, for smaller amounts of data traditional database systems, or Parallel Data Warehouses (PDW) can get the job done with far fewer number of nodes. To show the trade off let's see this example: eBay handles data roughly half or a third in size of Facebook's. eBay runs on PDW while Facebook runs Hadoop. Guess how much bigger Facebook's cluster is? About 10 times! Also, you sometimes see incoherence in Facebook updates which is a direct result of multiple copies and cheaper hardware. All in all, there are merits for both paradigms to co-exist and serve different purposes. In fact Microsoft adopted Hadoop in 2011. For the same reason some organizations (including Microsoft) embrace Sqoop (SQl to hadOOP) to bridge the gap.


Now the question is what you think of the future of Hadoop and the PI System. Do you see any feasible integration or use in there? For one thing, we cannot break down the archive files at random places in chunks of 64MB sizes as Hadoop requires. What if we make archive shards of 64MB in size and replicate whole copies of each on Hadoop data nodes? What operations you think can fit well in the MapReduce paradigm? Many more advanced analytics can very well be broken down (MapReduced) into parallel operations. How about data collection in the first place?


Do you see any downside to that? Do you see any major obstacle why PI System would not fit well into Hadoop platform?