Map Reduce

The MapReduce pattern is a powerful way to process large datasets by breaking down a complex task into two main phases: Map (processing individual items) and Reduce (aggregating results). Caskada provides the perfect abstractions to implement this pattern efficiently, especially with its ParallelFlow for concurrent mapping.

Core Components

  1. Trigger Node:

    • Role: Triggers the fan-out to Mapper Nodes from a list of items.

    • Implementation: A standard Node whose post method iterates through the input collection and calls this.trigger() for each item, passing item-specific data as forkingData.

  2. Mapper Node:

    • Role: Takes a single item from the input collection and transforms it.

    • Implementation: A standard Node that receives an individual item (often via forkingData in its local memory) and performs a specific computation (e.g., summarizing a document, extracting key-value pairs).

    • Output: Stores the processed result for that item in the global Memory object, typically using a unique key (e.g., memory.results[item_id] = processed_data).

  3. Reducer Node:

    • Role: Collects and aggregates the results from all Mapper Nodes.

    • Implementation: A standard Node that reads all individual results from the global Memory and combines them into a final output. This node is usually triggered after all Mapper Nodes have completed.

MapReduce Flow

Pattern Implementation

We use a ParallelFlow for the mapping phase for performance gain.

Example: Document Summarization

Let's create a flow to summarize multiple text files using the pattern we just created.

For simplicity, these will be overly-simplified mock tools/nodes. For a more in-depth implementation, check the implementations in our cookbook for Resume Qualification (Python) - more TypeScript examples coming soon (PRs welcome!).

1. Define Mapper Node (Summarizer)

This node will summarize a single file.

2. Define Reducer Node (Aggregator)

This node will combine all individual summaries.

3. Assemble the Flow

This example demonstrates how to implement a MapReduce pattern using Caskada, leveraging ParallelFlow for concurrent processing of the map phase and the Memory object for collecting results before the reduce phase.

Last updated