If you’re a frequent user of e-commerce websites, you’re probably familiar with the lists of related products these sites often feature to help customers find what they’re looking for on the store. Amazon, in particular, sports several such lists on every page, including “Frequently Bought Together” and “Customers Who Bought This Item Also Bought.”
If you’ve been to our site lately, or read our announcement, you know that Inventables is now an e-commerce site. As such, we thought it would be a good time to explore adding some of these types of lists to our product pages.
In this post I’ll talk about how we scripted a process that uses Hadoop MapReduce to parse the millions of product views we’ve accumulated over the past year or so, and put it into a format that can be used to display the list “People who viewed this product also viewed” for every product on our site. To see examples of this list in the wild, check out one of our cool products.
Formal Definition of our List
Given two products
p2, we will define
p1 related-by-path-to p2 (and vice versa) if there is exists a user-session-specific path P where
page-view(p2) are both members of P. Additionally, the weight of this relationship between
p2 will be equal to the number of distinct paths for which they are both members. If 10 users viewed both
p2 during a single visit, then those two products will have a path-relation of weight 10.
To display the list for a given product, we will sort the set of related products in descending order of weight. For example, let’s say you’re looking at Rubber Glass. If you see Flexible Metal Brick as the first item in the People who viewed Rubber Glass also viewed list, then that means more people who looked at Rubber Glass looked at Flexible Metal Brick than any other product on the site.
The Raw Data
To build our list we’ll be drawing on a single table from our application database. This table stores an entry for every unique product viewed by a user during a given session. Only two columns from this table are needed for our task:
product_id. We can export this data from our Amazon RDS MySQL instance, using the mysql command line utility like so:
mysql -u $user --password=$pass --host=$host --batch -e \ "SELECT path_id, product_id FROM recent_items" \ $database > /tmp/input/recent_items.tsv
This tab-separated file will constitute the initial input to our MapReduce workflow.
I selected Hadoop Streaming to drive the MapReduce workflow because of the large volume of data we’re dealing with, and also because we run our application infrastructure on the Amazon AWS platform, which, in the form of their Amazon Elastic MapReduce offering, has built-in support for Hadoop Streaming. For now we can get by with running Hadoop in single node mode, but if our scale grows or we start doing more things with MapReduce, we’ll be able to easily switch to true cluster mode.
The goal of this exercise was to start with a table of
path_id, product_id values and ultimately generate a new table of
source_product_id, target_product_id, weight values, where the source and target product are related to each other with the given weight. From this final data format, it is trivial to build a gallery to display the top n related products for a given source product.
To get from the millions of
path_id, product_id records to the desired
source_product_id, target_product_id, weight format using MapReduce, we take a two-phase approach.
MapReduce Phase 1
This phase creates an output file where each row represents one instance of a product-to-product relationship from a single path. As such, each row will carry a weight of 1.
The input to this function is already in the desired format of
path_id, product_id, so this mapping step is just the identity function.
Hadoop MapReduce guarantees that all rows for a given key emitted by the map step will go to the same reducer. In our case, the key is a path id. So in this first reducer, we iterate over all the rows for a given path, appending each product id to an array. Once we have all the product id’s for a given path, we can emit all of the permutations of
product_id, product_id pairs.
For example, let’s say the input included the following rows:
path1, product1 path1, product2 path1, product3
(The user looked at 3 products during his session.)
The reducer would first map
path1 to the array
[product1, product2, product3]. Then, it would emit six key/value pairs of the form
product1_product2, 1 product1_product3, 1 product2_product1, 1 product2_product3, 1 product3_product1, 1 product3_product2: 1
And the ruby source:
MapReduce Phase 2
The second and final MapReduce will sum the weights associated with every unique
Since the input to the second map phase is already in the desired format of
product1_product2, weight, the identity function can also be used here.
The reduce step, being given all the weights for a given product-product combination at a time, simply iterates over each key, sums the weight, and emits one key/value pair with the total.
Pulling in the data
After the second MapReduce phase completes, we have the data we need. All that’s needed is to import it into our database. We can accomplish this again with the mysql command line utility:
mysqlimport --local --compress -u $user --host=$host \ --columns=source_product_id, target_product_id,count \ --replace $database \ /tmp/final_output/viewed_products.tsv
viewed_products table has a unique index on
Stitching it all up
We want our process to be automated so that it can be run on a schedule. This can be accomplished with a simple shell script.
As you can see, it wasn’t hard at all to build our first list of related products using Hadoop MapReduce. We plan to do a lot more in this realm going forward.