Process large data volume workloads by splitting and processing in parallel
Problem
The Internet has provided opportunities for new applications that generate enormous volumes of data, such as stock trading, web crawling, online product searches, data mining, etc. Such applications have common characteristics of large data volumes, shared storage and computing steps that can be executed in parallel.
Solution
Provide map-reduce as a service that can assess processing needs and processes execution steps in parallel or grid mode. Has the capability to schedule tasks, split and distribute data across parallel streams through the grid and consolidate results as they drop. Leverage other as-a-service tasks such as load-balancing, auto-scaling, resource management to optimize workloads on the grid.
Application
- Large volumes of data crunching and aggregation
- Improve performance of some applications such as web crawling, product search, etc.
- Large volumes of transactional data requiring real-time processing such as stock trades with high availability and reliability
- Assess user behavior and web-ad impacts
Impacts
- Potential costs due to resource allocation across servers, network bandwidth and storage
- Potential limitations due to software licensing
Anti-Patterns
- Serial execution during off-peak periods when resource utilization is low
- Process using multiple machines interlaced with a number of manual steps
- Install and administer the map reduce software (Hadoop or alternatives) and manage the scale-out nature of this process across dozens or thousands of virtual machines
Related Patterns
- Facade Pattern
- Pipeline Pattern
- VM Pooling
References
- Google Map-Reduce
- Apache Hadoop



