Fuzzy Matching with Apache Hop (Pentaho)

Written on

January 2, 2025

Here you will learn how to do “fuzzy matching” with the Apache Hop ETL platform. Our challenge will be to take 2 sets of grocery products from different vendors, and to match up pairs of products that have similar names.

First off, we will be working on these example files:

Download example files in .zip 📁

The examples were created in version 2.9.0 of Apache Hop.

Next, you’ll need to download Apache Hop (get the .zip file with the highest numbered version). Hop requires the Java Runtime to run on your computer – I recommend getting the free “Adoptium”. I’m assuming that you already know Apache Hop basics and will focus on Fuzzy Matching in particular.

The musical accompaniment to our adventure is the album “Born in Fire” by the band Sacred Skin.

If you like this you’ll also like Gunship, Carpenter Brut and Vandal Moon.

Fuzzy matching example

We’re starting with 2 CSV files of grocery products. 01-main-stream-TnT.csv contains products from a Canadian grocery store called “T&T”. Here’s a sample:

And the file 02-lookup-against-Galleria.csv has products from another grocer called “Galleria”:

The T&T file is missing a “UPC” (Universal Product Code) which is an important product identifier. Galleria’s products have that code. If only there was some way to get the codes from Galleria’s products and add it to the same product on the T&T file…

Fuzzy matching to the rescue!

Our goal is to add UPC numbers to the T&T products by matching with the same products on the Galleria list. The product names are different between the two vendors: which is why we need to perform an approximate match – a “fuzzy” match.

Because each product also has a “unit size”, I did some pre-processing of the data by lowercasing all text and joining the product name and unit using an @ sign. The composite column we’ll be fuzzy-matching is called “matchagainst“.

That means that a product called “Aroy-D Canned Jackfruit” with a 565 g unit size, is transformed into the string "aroy-d canned jackfruit@565g"

Our overall approach will be to take the T&T products, ingest 1 at a time, and then try to fuzzy match it against the Entire Galleria set of products. Over and over for each individual T&T product:

We will use 2 Apache Hop pipelines for this. One “parent” for ingesting each T&T product and a second “child” for matching that T&T product against the entire Galleria CSV.

The final output will look like this:

(Each row lets us match the “currentProduct” from T&T to the “upc” from Galleria.)

Child pipeline

Let’s look at the fuzzy-matching pipeline that does the main job. It is in the file fuzzymatch-each-product.hpl. The parts of the pipeline are labeled and explained below:

1. Get the “current” T&T product name

This pipeline will process just 1 T&T product name at a time. The Parent Pipeline will pass that product name to us through a variable – and step #1 is a “Get Variables” transform to get that value from the parent.

Note: we don’t need to ingest the whole T&T CSV file just to test this pipeline. We can run this pipeline with 1 default value for our variables. Just click anywhere on the pipeline canvas, click “Edit Pipeline” and you’ll see 2 variables (aka Parameters) that are pre set with default values. Here are the default values I had set in the sample child pipeline:

2. Ingest each Galleria product

Label 2 is a “CSV File Input” transform. It loads the list of Galleria products. We will compare each product with the single T&T product we’re fetching from the variable defined above.

Click “Get Fields” to detect the fields in the CSV file. It’ll try to set the column type to Integer (for SKU) based on the data, but for our case it is OK to just set everything to String for simplicity’s sake. You can also set the length to 20 characters for every field.

What’s going to happen with this pipeline is that it is going to read in the Galleria CSV and run each Galleria product through the entire pipeline. Each one will be fuzzy-matched against the 1 T&T product that remains constant (in a Variable) to see if we have a “match”.

3. The fuzzy match step

Our aim for the Fuzzy Match step is for it to output only the best matching of the Galleria products for our 1 T&T product.

Setting up the Fuzzy Match transform is so tricky that it is the main reason I wrote this post.

Here is how to navigate the Settings:

Lookup transform – this is the source step that contains the long list of potential items that we’re trying to match our 1 value with. In our case, this is the list of Galleria products with UPCs.
Lookup field – the specific field that we’ll fuzzy-match against. This is coming from the step in the “lookup transform”. In our case, it is our Galleria “product name & unit size” composite value.
Main stream field – this is the 1 main product we’re working with. In our case, it is the T&T Product that comes from the variable currentProduct. (Note that we have a special “fetch variable” step that has turned the variable into a “field” that we use here)
Algorithm – the fuzzy-matching algorithm we’ll use to find the best approximate match between our 1 T&T Product (in any given run of the child pipeline) and the many Galleria Products that we’re comparing against. The algorithms are listed and explained in the Apache Hop docs for Fuzzy Match. Your options are:
- Levenshtein
- Damerau Levenshtein
- Needleman Wunsch
- Jaro
- Jaro Winkler
- Pair letters similarity
- Metaphone
- Double Metaphone
- SoundEx
- Refined SoundEx
Get closer value – when this checkbox is TRUE, the fuzzy match will only get the closest matching product name from the available options. If the checkbox is FALSE, it’ll return a list of all products that have a similarity above the threshold you set, concatenated with the “Values separator” you provide.
Minimal value – the minimum similarity between 2 product names that you’ll consider a “match”. Remember that a value of 1 means “The 2 strings are identical” and less-than-one loosens up the conditions for what qualifies as a “match”.

My advice about fuzzy matching algorithms: read how the different algorithms work, and pick those that are applicable for your use case. Then, use data from your actual dataset to try out different algorithms and match thresholds. This will let you “dial in” on the settings that give you an acceptable number of false positives / false negatives.

Fuzzy matching will never be 100% accurate. You’ll probably need to do a manual quality check after your run.

Here is the second settings window and it has to do with the output values:

Match field – this is the name for the output field that will contain the actual value that was a match. In our case, it’ll be the “matchagainst” value from the Galleria values.
Value field – this field will contain the numerical “match quality” for the algorithm you pick. (see below)
Get fields – if you want to get additional fields from the “Lookup transform” then you need to click “Get fields” and include them in the output from this transform. In our case, these are the columns from the Galleria data – we’re looking for that “upc” column so that is the most critical field to have in that list.

Here’s a real example of the output from a fuzzy match step:

“currentProduct” is the T&T Product name we fed into the step. It was the same value for each of the Galleria products we compared against.
“found” is the field that actually matched on the Galleria side. Note that it is a different value from “currentProduct” but it is similar enough to match using the Jaro algorithm. The fuzzy match was successful!
“match strength” indicates how strong the match is. Here, it is an 0.885 quality match – above the 0.78 threshold in the screenshot but lower than a “1” identical value match.
All the fields here were pulled from the same row in the Galleria CSV as the product that fuzzy-matched.

For additional help on the Fuzzy Match transform you can read the Hop documentation for it.

4. Fetch the T&T SKU value into our output stream

This step is another “Get Variable” transform step that grabs the T&T internal product ID (the “SKU”) from the parent pipeline, and puts it into the output stream after our Fuzzy Match.

5. “Copy rows to result”

The last step is “Copy rows to result”. It’ll pass all our “child’s” data to the parent pipeline: the T&T Product identifier “currentProduct”, the T&T SKU (a T&T-specific product identifier) and all the data for the matching Galleria product.

Because our Fuzzy Match transform is only fetching the 1 closest match for a given T&T product, we are also passing just 1 row of results to the parent pipeline.

At this point you can click the ▶️ icon on the child pipeline to run the fuzzy match against our 1 default product that’s set as a variable.

Parent pipeline

Let’s look at the “parent pipeline” that iterates through every T&T product and feeds them into the fuzzy match “child pipeline”. The Parent pipeline is in the file parent-pipeline.hpl and it looks like this:

What it does is read the T&T products from a CSV file; sends each product to the child fuzzymatch-each-product.hpl pipeline; and writes the output of all the matches into another CSV file.

The trickiest parts are in the middle Pipeline executor transform. In the settings, I am setting up 2 variables – “varCurrentProduct” and “varCurrentSKU” that will contain the T&T product name (from the .csv input field “matchagainst”) and the T&T SKU. These variables will be available to the Child pipeline through Variables:

The Child Pipeline’s values for these variables will change on every execution, with every row of T&T data.

Additionally, when you SHIFT-drag from this “pipeline execute” step to connect it to the CSV file writer, you’ll see 5 different options for the kind of data you want in that stream. Choose “This output will contain the result rows after execution“

The different types of output will result in the below values outputting to disk:

Final result

You perform the entire fuzzy match for all products by clicking the ▶️ icon on the Parent Pipeline parent-pipeline.hpl

I ran the Jaro fuzzy-match with a match threshold of 0.78 and here are the results I got. Successful matches are marked in green, false-positives in red:

You can see that, for example, “red curry” on the T&T product and “coconut milk” on the Galleria side are considered a successful match. This is too loose – I don’t want matches like these. So we need to raise the number for our match threshold.

Most of the accurate matches in green have a “match strength” value above 0.86 so that’s going to be the new “minimal value” for the Fuzzy Match step. The re-run results look like this:

Much better!

But there are still a couple of products that aren’t exactly the same. We can catch this with a manual review…

Tips

Apache Hop has a very quirky user interface and an odd mental model. Here are a few tips as you adapt my sample files to your use case:

Error reporting

Use the “Metrics” and “Logging” tab at the bottom of the canvas to see what went wrong with your pipeline run. Error messages will appear in “Logging”.

Detail-level for error messages

If you need a greater level of detail in your error messages, when you do a “full run” you’ll be prompted to set the granularity of the log messages:

Data snapshots at each stage of the pipeline

In every successful run, you’ll see these “data table” icons appear next to many Transforms on your canvas. Left click on that tiny icon to see an immediate preview of the actual data at that step. (If your step doesn’t show this little icon, you can add a “Dummy”-type transform that’ll allow you to see data at that stage)

If your pipeline breaks after changing a CSV file

If your working pipeline stops working after you make a change to a CSV file, then you often need to refresh the list of fields in Hop so it reflects the latest fields in the CSV.

Go to your CSV file input transform and click “Get Fields” to refresh the list. Remember to also go to your Fuzzy Matching transform, go to the second tab that says “Fields” and click “Get Fields” there too!

Dear reader: if you are comfortable with Apache Hop and have spotted a mistake – please get in touch with me at “jacob” at this website – or write your own post and I’ll gladly link to you.

A word about “Fuzzy Grouping”

Fuzzy Grouping is where you have a set of values, and you find ones that are similar to each other through a fuzzy comparison. Fuzzy Matching compares 1 value to 1 other value. Fuzzy Grouping discovers groupings of similar values without your knowing the proper groupings ahead of time.

In our grocery example, this would be like getting product names from 7 vendors and making sure that the same products (with a slightly different name) are properly grouped together.

There are very few tools for fuzzy grouping. I know that SSIS has this, and I’m pretty sure that WinPure also does fuzzy grouping. The free OpenRefine’s clustering tools might be fit for the job.

The basic way to fuzzy group in Apache Hop is to combine all 7 vendors’ products into 1 CSV, run it through the parent and run it through the child – so you get every product trying to fuzzy match with every other product across all vendors.

There are certain filtering steps that you can perform to reduce the repetitive work (ex. don’t compare a T&T product against other T&T products).

Here is my example of a fuzzy grouping setup:

Each vendor’s data is passed through a separate fuzzy match step that operates on the 1 “input product” from the “get variables” transform.

As data flows into the pipeline from the CSV, it is split into separate vendor paths (“Vendor switch” on the left) through a “Switch / case” transform with the following settings:

And the final output looks like this:

You can see how different vendor’s products (“match” column) are matched up to the 1 product fed into the process (“myprod”). Vendors without a matching product name result in a <null> row.

Get monthly updates. Who’s dating whom on the blog? Who’s my baby’s daddy? Pinocchio vs. Pinochet who wins?