"""# HW5: MapReduce
While mapreduce is an important concept in large scale parallel systems it is possible to simulate it locally and on small practice problems. To this end we provide a simple framework that allows you to run and test map/reduce functions below. This assignment asks you to write the code for two, two-stage mapreduce problems.
"""
from collections import defaultdictimport stringimport numpy as np
def simple_mapreduce(map_fn, reduce_fn, kvin): keyvalues = defaultdict(list) for k, v in kvin: for outk, outv in map_fn(k, v): keyvalues[outk] += [outv] kvout = [] for k, vs in keyvalues.items(): for outk, outv in reduce_fn(k, vs): kvout += [(outk, outv)] return kvout
"""
The input given to you describes a dataset of nodes connected to each other via edges. An example of such a network could be the Twitter follower network, in which the users denote nodes and two nodes are connected by an edge if one follows the other.
* The **degree** of a node in a network is the number of connections it has to other nodes. * The **degree distribution** is the distribution of these degrees over the whole network.
As an example, you can assume that the input you are given is an edgelist in the form of a table as shown below. This table shows pairs of edges connected to each other and thus capture the graph shown on the right:
Source | Destination -------|-------------A | BA | CA | D B | C
The degree distribution then is Degree | Count-------|-------1 | 12 | 23 | 1As you can see, we have one nodes with degree 1 (which is D), two node with degree 2 (which is B and C), and one node with degree 3 (which is A). Thus, the degree distribution captures, for each value of degree, the number of nodes that have that degree. Below you can see the network mapped out"""
%matplotlib inline import matplotlib.pyplot as pltplt.axis("off")plt.scatter( [0, 0, 1, 0], [1, 0, 0, -1], s=1000, marker='o')# Annotate with text + Arrowplt.plot([0, 0], [-1, 1], c="r", zorder=-1)plt.plot([0, 1], [0, 0], c="r", zorder=-1)plt.plot([0, 1], [-1, 0], c="r", zorder=-1)
plt.annotate('A', xy=(0, 0), xytext=(0, 0), c="w")plt.annotate('B', xy=(1, 0), xytext=(1, 0), c="w")plt.annotate('C', xy=(0, -1), xytext=(0, -1), c="w")plt.annotate('D', xy=(0, 1), xytext=(0, 1), c="w")
#=-------------------------
network_dataset = [ ('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'),]
network_expected = [ (3, 1), (2, 2), (1, 1)]
# you are to write the body of the following functions: (replace the `pass`'s with your code
# k=source node, v=destination nodedef q2_map1(k, v): yield k, len(v) def q2_reduce1(k, values): yield k, sum(values) def q2_map2(k, v): pass
def q2_reduce2(k, values): pass
first_stage = simple_mapreduce(q2_map1, q2_reduce1, network_dataset)network_final = simple_mapreduce(q2_map2, q2_reduce2, first_stage)network_final
sorted(network_expected) == sorted(network_final)