If you’re a frequent user of e-commerce websites, you’re probably familiar with the lists of related products these sites often feature to help customers find what they’re looking for on the store. Amazon, in particular, sports several such lists on every page, including “Frequently Bought Together” and “Customers Who Bought This Item Also Bought.”
If you’ve been to our site lately, or read our announcement, you know that Inventables is now an e-commerce site. As such, we thought it would be a good time to explore adding some of these types of lists to our product pages.
In this post I’ll talk about how we scripted a process that uses Hadoop MapReduce to parse the millions of product views we’ve accumulated over the past year or so, and put it into a format that can be used to display the list “People who viewed this product also viewed” for every product on our site. To see examples of this list in the wild, check out one of our cool products.
Formal Definition of our List
Given two products p1
and p2
, we will define p1 related-by-path-to p2
(and vice versa) if there is exists a user-session-specific path P where page-view(p1)
and page-view(p2)
are both members of P. Additionally, the weight of this relationship between p1
and p2
will be equal to the number of distinct paths for which they are both members. If 10 users viewed both p1
and p2
during a single visit, then those two products will have a path-relation of weight 10.
To display the list for a given product, we will sort the set of related products in descending order of weight. For example, let’s say you’re looking at Rubber Glass. If you see Flexible Metal Brick as the first item in the People who viewed Rubber Glass also viewed list, then that means more people who looked at Rubber Glass looked at Flexible Metal Brick than any other product on the site.
The Raw Data
To build our list we’ll be drawing on a single table from our application database. This table stores an entry for every unique product viewed by a user during a given session. Only two columns from this table are needed for our task: path_id
and product_id
. We can export this data from our Amazon RDS MySQL instance, using the mysql command line utility like so:
mysql -u $user --password=$pass --host=$host --batch -e \
"SELECT path_id, product_id FROM recent_items" \
$database > /tmp/input/recent_items.tsv
This tab-separated file will constitute the initial input to our MapReduce workflow.
Tool Selection
I selected Hadoop Streaming to drive the MapReduce workflow because of the large volume of data we’re dealing with, and also because we run our application infrastructure on the Amazon AWS platform, which, in the form of their Amazon Elastic MapReduce offering, has built-in support for Hadoop Streaming. For now we can get by with running Hadoop in single node mode, but if our scale grows or we start doing more things with MapReduce, we’ll be able to easily switch to true cluster mode.
Solution Outline
The goal of this exercise was to start with a table of path_id, product_id
values and ultimately generate a new table of source_product_id, target_product_id, weight
values, where the source and target product are related to each other with the given weight. From this final data format, it is trivial to build a gallery to display the top n related products for a given source product.
To get from the millions of path_id, product_id
records to the desired source_product_id, target_product_id, weight
format using MapReduce, we take a two-phase approach.
MapReduce Phase 1
This phase creates an output file where each row represents one instance of a product-to-product relationship from a single path. As such, each row will carry a weight of 1.
Map:
The input to this function is already in the desired format of path_id, product_id
, so this mapping step is just the identity function.
Reduce:
Hadoop MapReduce guarantees that all rows for a given key emitted by the map step will go to the same reducer. In our case, the key is a path id. So in this first reducer, we iterate over all the rows for a given path, appending each product id to an array. Once we have all the product id’s for a given path, we can emit all of the permutations of product_id, product_id
pairs.
For example, let’s say the input included the following rows:
path1, product1
path1, product2
path1, product3
(The user looked at 3 products during his session.)
The reducer would first map path1
to the array [product1, product2, product3]
. Then, it would emit six key/value pairs of the form source_product_id_target_product_id, weight
:
product1_product2, 1
product1_product3, 1
product2_product1, 1
product2_product3, 1
product3_product1, 1
product3_product2: 1
And the ruby source:
MapReduce Phase 2
The second and final MapReduce will sum the weights associated with every unique product1_product2
key.
Map:
Since the input to the second map phase is already in the desired format of product1_product2, weight
, the identity function can also be used here.
Reduce:
The reduce step, being given all the weights for a given product-product combination at a time, simply iterates over each key, sums the weight, and emits one key/value pair with the total.
Ruby source:
Pulling in the data
After the second MapReduce phase completes, we have the data we need. All that’s needed is to import it into our database. We can accomplish this again with the mysql command line utility:
mysqlimport --local --compress -u $user --host=$host \
--columns=source_product_id, target_product_id,count \
--replace $database \
/tmp/final_output/viewed_products.tsv
Note: the viewed_products
table has a unique index on source_product_id, target_product_id
.
Stitching it all up
We want our process to be automated so that it can be run on a schedule. This can be accomplished with a simple shell script.
Conclusion
As you can see, it wasn’t hard at all to build our first list of related products using Hadoop MapReduce. We plan to do a lot more in this realm going forward.
–Jeff