When we learned about women printers from Victorian times, I mentioned that Austrian compositor Wally Prohaska worked with a business partner – Anton Halauska.
Anton’s work was quite prominent in the pages of the Printers’ International Specimen Exchange. It was punchy. Just look at this bangin’ self-portrait:
In addition to being a printer, Anton’s father ran a bookstore in Olmütz. In the year 1861, the bookstore failed and his business went bankrupt. That’s a risk that every businessman takes on. But what’s unusual is what his father did next: He lied to his creditors, pretending that he got a fresh cash investment into the business to keep it going.
English translation (from Chat GPT)
A Warning.
A domestic colleague received a summons from a notary in Olmütz dated May 17 of this year, instructing him to appear on June 3 as a creditor of the printing and book business of Anton Halauska in Olmütz, or to secure his claims through a representative.
On the same day that this summons was issued, a printed circular from A. Halauska, dated May 18, arrived. In it, he declared his suspension of payments while at the same time announcing that business would continue with new strength and under more favorable conditions. This was supposedly possible because “Mr. Fleischmann in Olmütz, who is known as a capable businessman and has significant capital at his disposal, had agreed to become his business partner.”
Since further details about the aforementioned Mr. Fleischmann were unknown, it was stated that Mr. Fleischmann was willing to provide further information if requested via Mr. Braumüller in Vienna.
Believing that an amicable settlement was preferable to legal proceedings, the recipient of the circular from A. Halauska was inclined to trust its claims. However, as a precaution, he contacted Mr. Braumüller with the request for confirmation regarding Mr. Fleischmann, based on the statements in the circular.
Mr. Braumüller was courteous enough to reply, stating that he neither personally knew the aforementioned Mr. Fleischmann nor was he in a position to provide any details about his financial situation. Furthermore, he had already distanced himself from the circular and requested a public retraction of the statement that referenced him.
In the meantime, this correspondence resulted in the registration deadline being postponed to June 3.
The simple presentation of these facts clearly shows that A. Halauska’s circular had no other purpose than to deceive creditors, keeping them calm while preventing them from asserting their claims at the right time. Otherwise, what purpose would such a registration deadline or the recommendation from a highly esteemed colleague serve?
Such self-serving conduct does not require further commentary. It is not merely a duty but a necessity to warn against it!
Anton Jr. served one year in the army, and went on to get an eclectic education which included becoming a master stenographer (publishing a book about the subject). Afterwards, Anton wished to found his own print shop in Salzburg but was denied permission. He went on to establish one in the nearby Austrian town of Hallein with with Wally Prohaska – with doors opening on December 15, 1882.
Anton was an “Artistic Printer“, which means that he worked at the cutting edge of print design. Here is an example of Halauska’s printing:
In 1883, a year after establishing his business, Anton’ father passed away at the age of 70. Anton himself will not live to such an old age.
In Hallein, Anton invented the textured printing effect of “Selenotype”. For that and other contributions to print, he received permission to use the imperial eagle in his coat of arms and seal.
A sample of Selenotype (source)Halauska’s using the Imperial EagleAn ad for Selenotype, 1886 (source)
In 1888, The British Printerran a profile of Halauska, shown below. At this point, the talented Austrian printer’ fame has reached England.
In 1888, Anton and Wally were awarded a Bronze Medal and Honorary Diploma at the German National Arts and Crafts Exhibition in Munich (as reported in “Buchdrucker-Zeitung” and listed in the official record)
Later, in 1893, Halauska travelled to the World’s Columbian Exposition (“The World Fair”) in Chicago to represent his country.
According to a Jan. 27, 1906 issue of the Hallein Volksfreund, Halauska’s printing house carried on business in Hallein up until 1895. The business apparently moved to Hallein from Zell am See, and Anton claimed to operate in both locations. In 1896, the press finally moved to Salzburg – the original town where Anton wanted to base his business. Notably, Anton published the calendar “Der Bote aus dem Salzachthale” and “Technisches Jahrbuch für den Buch- und Kunstdruck” – a “technical yearbook of book and art printing” with examples produced mostly by Anton himself.
A short 3 years after getting married and moving to Salzburg, Halauska died from an illness. He passed away on the 8th of November, 1899 at his home at 9 Giselakai in Salzburg (source). He was 47 years old.
Anton Halauska’s last home, number 9.
In 1900 you start seeing references to “Buchdruckerei von A. Halauska’s Witwe” which is the “Printing house of A. Halauska’s Widow”. Augusta may have restarted the business for a while with a partner named “Eiblhuber” or may have simply used the Haulaska name to give endorsements to print equipment manufacturers.
There’s also an academic article about Anton Sr. in Czech called “Olomoucký tiskař Anton Halauska, aneb, ze Seničky do světa”, by Stanislava Kovářová. In Střední Morava. — ISSN 1211-7889. — Roč. 14, č. 26 (2008), s. 123-128
In November 1983 a set of remarkable machines arrived at the Ontario Ministry of Education offices. These were the prototypes for Ontario’s very own computer: the ICON.
The ICON came to have lots of names: the Cemcorp ICON, Unisys ICON, Burroughs ICON and… “The Bionic Beaver”. It represents a time when Ontario was on the cutting edge. In the early 1980s, personal computers were still a new concept and there was debate on how to use PCs in the educational system – if at all. Despite the debate, the Ministry of Education went ahead with commissioned a computer to meet students’ needs.
A recent tumblr post about the ICON made the rounds online. In that post, the author relied on their memory to hand-draw “screenshots” from ICON programs, because no real screenshots exist online:
Greeting screen of the ICON (source). I believe it would also say “hello” through the built-in speech synthesizer.
The ICON’s beginnings
The ICON story started in 1982, when the Ontario Ministry of Education laid out the vision for computer use in the classroom. Computers were to be a tool for students to extend original thought: to write, compose, design and analyze. Not just as a terminal for accessing raw information. This feels like a very intelligent approach to computers.
They created the GEMS subsidy (Grant Eligible Microcomputer System) with special requirements as to hardware, Canadian content and an approach to computing that supported the Ministry’s approach to education. Only the Cemcorp consortium’s ICON computer was eligible at first. Schoolboards that bought a GEMS-qualifying computer system would have the Province reimburse 75% of the cost.
This book review from 1986 shows that not everyone was on board with PCs in Classrooms Source – Orbit 77 (1986)
Features and software
The ICONs lacked a hard drive or floppy disk – they would only work if they were connected to each other and to a “server” computer called the LEXICON (the plain box with a screen at the far left):
“Left: Lexicon server running ICON System 3.00.04 (1988) based on QNX 2.05b. Center: CEMCORP ICON 1. Right: Unisys ICON 2” (source)
Each time students turned on the ICON, it would download it’s operating system from the LEXICON server anew. At the end of a session, students could save their files on the LEXICON’s hard disk or floppy drive.
The LEXICON had a speech synthesizer and you could use the “say” command to vocalize whatever you typed.
Its standout feature was a near-indestructible trackball that was built into the keyboard.
“Left: Unisys ICON 2 at login prompt. Right: Unisys PW2 Advantage 4336DX server running ICON System 5.00.02 (1993) based on QNX 3.15g” (source)
Crosscountry Canada (Crosscountry on Wikipedia) (Possibly) Ernie’s Big Splash Ambience Map Manoeuvre Mathrace (re-release) Mathville (2 disks) Measuring II (2 disks) Melody Manipulations (re-release) Menulay II (2 disks) Micro News (re-release) Mind Your Own Business Musica Musicland (re-release) Music Toolkit New Frontiers (2 disks) New Kid In Town The Number Place (re-release) Ambience Offshore Fishing On My Way (5 disks) Putting Yourself Together Puddles to Pondwater QSPREAD Ambience Queues Quiz ‘N Art The Rebels Refugees in the Wilderness (related , related – and pdf local copy) Robot R & D Subdicion Planner Time Manager Tour of the Universe Ambience Treasure of Ile Madame Ambience Two-file Merge Ambience United We Stand Ambience Upstairs-Downstairs (a maze game) OESS The Voyages of Columbus (2 disks) Watfile/Plus What is Weather (2 disks) Wpro Yes and No Ambience The Academy (2 disks) Adventure Ontario (4 disks) Animals/Garbage Watfile Databases OESS Art Treasures; Unusual Countries
Ambience Array Game Astronomy A Week in the Life of … (2 disks) Ambience B.C. Lumbering Build-a-Bird Build a Land Bird Build a Shore Bird Canadian Shield Railway Ambience Cargo Sailer OESS Cattle; Contributing Canadians Choices Jr. Cloze Encounters Unlimited (4 disks) Computer Type Ambience Data Classification Dynamap (2 disks) Eco-Island English 1 (5 disks) Explorer Finding Our Way Foodfare (2 disks) From The Apple to The Moon Geometry Mart OESS Get Ready For Math Greenhouse (re-release) Greenhouse 3 / LA Serre 2 The Golden West (2 disks) Imagine (re-release) Ambience Infoschool Interail (5 disks) Ipaint II+ Italk IZZIT Storymate Resort Development Simulator Mathville (re-release ) ( 2 disks) Money Market Mindflight Bundle Admin and Utilities Mindflight Bundle (6 disks) Mindflight Bundle Tools for Schools
OESS Moving Words Cadtutor Update Alice: The Personal Pascal Electric Chemistry Building, Phase II (3 disks) OESS Casi Accounting Function Worshop (2 disks) Electric Chemestry Building (2 disks) QNX CADTutor Electric Chemistry Building, Phase III (4 disks) Business Development Simulation (2 disks) Unisys Icon System Software Release 2.25 Rev 3 Jobs for You OESS Know Your Numbers OESS Know to Add OESS Learn to Count OESS Learn to Subtract OESS Learning Game Generator Life in New France Longhouse (2 disks) Conduct – Camp System Diskette Conduct – QNX Version Conduct – Ambience Version Choices Jr (DOS Version) (2 disks) Computer Architecture (re-release) (2 disks) Chemistryland (re-release) (2 disks) Decide Your Excellency Intuitrig (re-release) Ipaint II Let There be Light (2 disks) Keep It Running – Rally Keep It Running – Garage Lemming Count The Sheridan Prewriter Ambience Sequential Search (Possibly)Northwest Fur Trader or Voyageur
“The Ambience” was a user-interface designed to improve the use of lessonware on the ICON computer. That’s why so many of these programs have the word “Ambience” in their name.
You can get an idea of what the PC revolution felt like for a class of Grade 1 students from this wonderful ethnographic study of the ICON from 1989. The report goes into the details of some of the educational programs on the ICON and dives into childrens “fuzzy exploration” of computers – which didn’t always square up with a Government Minister’s ideas of computer use.
The ICON was an expensive project and, because of the strict educational requirements for software, it wasn’t appealing for commercial software development shops – every piece of software had to be commissioned by the government.
By 1987, some of the shine was off the ICON:
…the Icon, which was supposed to be a new specialty in the economy of Ontario when it was introduced, is now produced in Taiwan. One wonders about some of the rest of the program.
For example, I had discussions with a hands-on, very informed principal in my own school system in Hamilton, who had developed for his own school, prior to and during the Icon program, a very impressive delivery program which cost about $6,000. When the Icons were awarded to him, the cost was $35,000 for basically the same operation. One wonders how much is needlessly being spent across the province on the computers in education program.
If I read this proposal by the minister correctly, it is to open up the accessibility of the computer program to all sorts of hardware and software deliverers. In that sense, it will undoubtedly economize the system, and I am grateful for that. But I remind the minister that persons as eminent in science and technology as David Suzuki have recently written very sceptical things about the presence and place of computers in education.
Towards the early 1990s, there was a lot of software that was only suitable for the ICON and hadn’t been adapted to the popular IBM Compatible PCs flooding the market. The ICONs were underpowered compared to mainstream computers. Gradually, rules for GEMS were loosened so more vendors could qualify. Schools were buying computers outside the GEMS program (which meant no subsidy). A cross-compatibility platform for ICON software called EASI (Educational Application Software Interface) was started but seems to have never materialized.
In the end, PCs found a permanent home at schools and highschools in Ontario. These weren’t the educational ICONs, but rather the general-purpose Mac and Windows machines that ended up in Canadians’ family homes. The story of a made-in-Ontario computer came to a close.
In a way, the story of the Icon is a recurring theme of Canadian anti-competitive and anti-market behaviour. In order to accomplish an audacious goal, we create a consortium of organizations (CEMCorp) and essentially give it a monopoly over a product. In this instance, it is the Liberals calling it out in 1983:
Here you will learn how to do “fuzzy matching” with the Apache Hop ETL platform. Our challenge will be to take 2 sets of grocery products from different vendors, and to match up pairs of products that have similar names.
First off, we will be working on these example files:
The examples were created in version 2.9.0 of Apache Hop.
Next, you’ll need to download Apache Hop (get the .zip file with the highest numbered version). Hop requires the Java Runtime to run on your computer – I recommend getting the free “Adoptium”. I’m assuming that you already know Apache Hop basics and will focus on Fuzzy Matching in particular.
The musical accompaniment to our adventure is the album “Born in Fire” by the band Sacred Skin.
If you like this you’ll also like Gunship, Carpenter Brut and Vandal Moon.
Fuzzy matching example
We’re starting with 2 CSV files of grocery products. 01-main-stream-TnT.csv contains products from a Canadian grocery store called “T&T”. Here’s a sample:
And the file 02-lookup-against-Galleria.csv has products from another grocer called “Galleria”:
The T&T file is missing a “UPC” (Universal Product Code) which is an important product identifier. Galleria’s products have that code. If only there was some way to get the codes from Galleria’s products and add it to the same product on the T&T file…
Fuzzy matching to the rescue!
Our goal is to add UPC numbers to the T&T products by matching with the same products on the Galleria list. The product names are different between the two vendors: which is why we need to perform an approximate match – a “fuzzy” match.
Because each product also has a “unit size”, I did some pre-processing of the data by lowercasing all text and joining the product name and unit using an @ sign. The composite column we’ll be fuzzy-matching is called “matchagainst“.
That means that a product called “Aroy-D Canned Jackfruit” with a 565 g unit size, is transformed into the string "aroy-d canned jackfruit@565g"
Our overall approach will be to take the T&T products, ingest 1 at a time, and then try to fuzzy match it against the Entire Galleria set of products. Over and over for each individual T&T product:
We will use 2 Apache Hop pipelines for this. One “parent” for ingesting each T&T product and a second “child” for matching that T&T product against the entire Galleria CSV.
The final output will look like this:
(Each row lets us match the “currentProduct” from T&T to the “upc” from Galleria.)
Child pipeline
Let’s look at the fuzzy-matching pipeline that does the main job. It is in the file fuzzymatch-each-product.hpl. The parts of the pipeline are labeled and explained below:
1. Get the “current” T&T product name
This pipeline will process just 1 T&T product name at a time. The Parent Pipeline will pass that product name to us through a variable – and step #1 is a “Get Variables” transform to get that value from the parent.
Note: we don’t need to ingest the whole T&T CSV file just to test this pipeline. We can run this pipeline with 1 default value for our variables. Just click anywhere on the pipeline canvas, click “Edit Pipeline” and you’ll see 2 variables (aka Parameters) that are pre set with default values. Here are the default values I had set in the sample child pipeline:
2. Ingest each Galleria product
Label 2 is a “CSV File Input” transform. It loads the list of Galleria products. We will compare each product with the single T&T product we’re fetching from the variable defined above.
Click “Get Fields” to detect the fields in the CSV file. It’ll try to set the column type to Integer (for SKU) based on the data, but for our case it is OK to just set everything to String for simplicity’s sake. You can also set the length to 20 characters for every field.
What’s going to happen with this pipeline is that it is going to read in the Galleria CSV and run each Galleria product through the entire pipeline. Each one will be fuzzy-matched against the 1 T&T product that remains constant (in a Variable) to see if we have a “match”.
3. The fuzzy match step
Our aim for the Fuzzy Match step is for it to output only the best matching of the Galleria products for our 1 T&T product.
Setting up the Fuzzy Match transform is so tricky that it is the main reason I wrote this post.
Here is how to navigate the Settings:
Lookup transform – this is the source step that contains the long list of potential items that we’re trying to match our 1 value with. In our case, this is the list of Galleria products with UPCs.
Lookup field – the specific field that we’ll fuzzy-match against. This is coming from the step in the “lookup transform”. In our case, it is our Galleria “product name & unit size” composite value.
Main stream field – this is the 1 main product we’re working with. In our case, it is the T&T Product that comes from the variable currentProduct. (Note that we have a special “fetch variable” step that has turned the variable into a “field” that we use here)
Algorithm – the fuzzy-matching algorithm we’ll use to find the best approximate match between our 1 T&T Product (in any given run of the child pipeline) and the many Galleria Products that we’re comparing against. The algorithms are listed and explained in the Apache Hop docs for Fuzzy Match. Your options are:
Levenshtein
Damerau Levenshtein
Needleman Wunsch
Jaro
Jaro Winkler
Pair letters similarity
Metaphone
Double Metaphone
SoundEx
Refined SoundEx
Get closer value – when this checkbox is TRUE, the fuzzy match will only get the closest matching product name from the available options. If the checkbox is FALSE, it’ll return a list of all products that have a similarity above the threshold you set, concatenated with the “Values separator” you provide.
Minimal value – the minimum similarity between 2 product names that you’ll consider a “match”. Remember that a value of 1 means “The 2 strings are identical” and less-than-one loosens up the conditions for what qualifies as a “match”.
My advice about fuzzy matching algorithms: read how the different algorithms work, and pick those that are applicable for your use case. Then, use data from your actual dataset to try out different algorithms and match thresholds. This will let you “dial in” on the settings that give you an acceptable number of false positives / false negatives.
Fuzzy matching will never be 100% accurate. You’ll probably need to do a manual quality check after your run.
Here is the second settings window and it has to do with the output values:
Match field – this is the name for the output field that will contain the actual value that was a match. In our case, it’ll be the “matchagainst” value from the Galleria values.
Value field – this field will contain the numerical “match quality” for the algorithm you pick. (see below)
Get fields – if you want to get additional fields from the “Lookup transform” then you need to click “Get fields” and include them in the output from this transform. In our case, these are the columns from the Galleria data – we’re looking for that “upc” column so that is the most critical field to have in that list.
Here’s a real example of the output from a fuzzy match step:
“currentProduct” is the T&T Product name we fed into the step. It was the same value for each of the Galleria products we compared against.
“found” is the field that actually matched on the Galleria side. Note that it is a different value from “currentProduct” but it is similar enough to match using the Jaro algorithm. The fuzzy match was successful!
“match strength” indicates how strong the match is. Here, it is an 0.885 quality match – above the 0.78 threshold in the screenshot but lower than a “1” identical value match.
All the fields here were pulled from the same row in the Galleria CSV as the product that fuzzy-matched.
This step is another “Get Variable” transform step that grabs the T&T internal product ID (the “SKU”) from the parent pipeline, and puts it into the output stream after our Fuzzy Match.
5. “Copy rows to result”
The last step is “Copy rows to result”. It’ll pass all our “child’s” data to the parent pipeline: the T&T Product identifier “currentProduct”, the T&T SKU (a T&T-specific product identifier) and all the data for the matching Galleria product.
Because our Fuzzy Match transform is only fetching the 1 closest match for a given T&T product, we are also passing just 1 row of results to the parent pipeline.
At this point you can click the ▶️ icon on the child pipeline to run the fuzzy match against our 1 default product that’s set as a variable.
Parent pipeline
Let’s look at the “parent pipeline” that iterates through every T&T product and feeds them into the fuzzy match “child pipeline”. The Parent pipeline is in the file parent-pipeline.hpl and it looks like this:
What it does is read the T&T products from a CSV file; sends each product to the child fuzzymatch-each-product.hpl pipeline; and writes the output of all the matches into another CSV file.
The trickiest parts are in the middle Pipeline executor transform. In the settings, I am setting up 2 variables – “varCurrentProduct” and “varCurrentSKU” that will contain the T&T product name (from the .csv input field “matchagainst”) and the T&T SKU. These variables will be available to the Child pipeline through Variables:
The Child Pipeline’s values for these variables will change on every execution, with every row of T&T data.
Additionally, when you SHIFT-drag from this “pipeline execute” step to connect it to the CSV file writer, you’ll see 5 different options for the kind of data you want in that stream. Choose “This output will contain the result rows after execution“
The different types of output will result in the below values outputting to disk:
Final result
You perform the entire fuzzy match for all products by clicking the ▶️ icon on the Parent Pipeline parent-pipeline.hpl
I ran the Jaro fuzzy-match with a match threshold of 0.78 and here are the results I got. Successful matches are marked in green, false-positives in red:
You can see that, for example, “red curry” on the T&T product and “coconut milk” on the Galleria side are considered a successful match. This is too loose – I don’t want matches like these. So we need to raise the number for our match threshold.
Most of the accurate matches in green have a “match strength” value above 0.86 so that’s going to be the new “minimal value” for the Fuzzy Match step. The re-run results look like this:
Much better!
But there are still a couple of products that aren’t exactly the same. We can catch this with a manual review…
Tips
Apache Hop has a very quirky user interface and an odd mental model. Here are a few tips as you adapt my sample files to your use case:
Error reporting
Use the “Metrics” and “Logging” tab at the bottom of the canvas to see what went wrong with your pipeline run. Error messages will appear in “Logging”.
Detail-level for error messages
If you need a greater level of detail in your error messages, when you do a “full run” you’ll be prompted to set the granularity of the log messages:
Data snapshots at each stage of the pipeline
In every successful run, you’ll see these “data table” icons appear next to many Transforms on your canvas. Left click on that tiny icon to see an immediate preview of the actual data at that step. (If your step doesn’t show this little icon, you can add a “Dummy”-type transform that’ll allow you to see data at that stage)
If your pipeline breaks after changing a CSV file
If your working pipeline stops working after you make a change to a CSV file, then you often need to refresh the list of fields in Hop so it reflects the latest fields in the CSV.
Go to your CSV file input transform and click “Get Fields” to refresh the list. Remember to also go to your Fuzzy Matching transform, go to the second tab that says “Fields” and click “Get Fields” there too!
Dear reader: if you are comfortable with Apache Hop and have spotted a mistake – please get in touch with me at “jacob” at this website – or write your own post and I’ll gladly link to you.
A word about “Fuzzy Grouping”
Fuzzy Grouping is where you have a set of values, and you find ones that are similar to each other through a fuzzy comparison. Fuzzy Matching compares 1 value to 1 other value. Fuzzy Grouping discovers groupings of similar values without your knowing the proper groupings ahead of time.
In our grocery example, this would be like getting product names from 7 vendors and making sure that the same products (with a slightly different name) are properly grouped together.
The basic way to fuzzy group in Apache Hop is to combine all 7 vendors’ products into 1 CSV, run it through the parent and run it through the child – so you get every product trying to fuzzy match with every other product across all vendors.
There are certain filtering steps that you can perform to reduce the repetitive work (ex. don’t compare a T&T product against other T&T products).
Here is my example of a fuzzy grouping setup:
Each vendor’s data is passed through a separate fuzzy match step that operates on the 1 “input product” from the “get variables” transform.
As data flows into the pipeline from the CSV, it is split into separate vendor paths (“Vendor switch” on the left) through a “Switch / case” transform with the following settings:
And the final output looks like this:
You can see how different vendor’s products (“match” column) are matched up to the 1 product fed into the process (“myprod”). Vendors without a matching product name result in a <null> row.
It was the summer of 2005 and the Tourism Toronto agency had just launched a new “city brand” campaign with a new logo, tagline and ad campaign for the City of Toronto.
The campaign was bland and instantly forgettable. The big bold vision of a city of insurance adjusters:
The new initiative cost $4 million and was jointly created by TBWA Toronto and Brand Architecture International, based in New York. (Both part of Omnicom).
For many years Toronto has been a best kept secret tourist destination; a result of its people’s modest character. But the media and demanding travelers have discovered this cultural Mecca and are beginning to spread the word.
And a sleepyfing TV ad:
“A city that is forever unfolding” indeed
The big international reveal came on June 26, 2005 with a 2-page ad in the New York Times. The ad had typos (“For seventeen years the Beaches International Jazz Festival fills the city of Toronto with jazz”) and promoted cultural events that had already finished by the date of publication.
The June 26 NYT Ad
Dec. 20, 2025 update: I got a copy of the infamous ad from the Toronto Reference Library’s microfilm archive!
Here it is. Click on the images to see full-size and read the text. It’s… “somehow majestic”:
This ad doesn’t stand out in the June 26th issue. The film “War of the Worlds” had a much bigger presence with more ads, and flashier ones.
In Toronto’s defence, there was an even clunkier ad for “Canada” in that issue:
Is this supposed to be a folded photograph? The dust cover of your own hardcover book?
Sludge-like advertising copy referred to Toronto as “a product of natural occurrences.”
Toronto Star, Jul. 23, 2005
The terrible new brand drove Toronto designer Errol Soldanha to create the website “Toronto Limited” to discuss aspects of the campaign. What really stung for Soldanha and others was that a New York agency was hired to develop Toronto’s public identity. Were there no Toronto agencies that could do the job?
They look like ads for a pharmaceutical company, selling a variety of prescription antidepressants.
nasty font – I first read that as Loronto (for some reason), then Joronto then Toronto finally.
It’s simplicity and lack of real explanations encourages people to find out more. I don’t think what Toronto is or the essence of Toronto can be shown to people, it has to be found.
(This one is genuinely sweet. That’s just not how advertising is supposed to work.)
The logos remind me of 1960’s municipal architecture.
Flash forward 19 years after the “very Toronto” misadventure of Toronto Unlimited:
I was chatting with a colleague (hi JB!). He was about to travel to Toronto for our on-site and he left sightseeing plans until the last moment. I joked that our city motto is “Toronto: It Sneaks Up On Ya!“
And then I remembered – we had a real-life motto that was just as bad. “Toronto: unlimited”. But I couldn’t find an example of the logo. Nowhere in Google Images or Google search. Did I dream the whole thing? Am I actually an Alzheimer-addled Uzbekistani septuagenarian who’s hallucinated a whole life as a Torontonian? (JB, if you can hear me, send help!)
It felt like the whole episode was wiped from the Internet
I put my sleuthing skills to good use and compiled what archived data was available into this page. So that everyone may know: no matter how badly you messed up at work today, it’s still nowhere as bad as
P.S. I got a kick from knowing that, when it comes to Toronto Unlimited, I’m on the same page as JT Singh – the mastermind behind Pyongyang’s tourism brand.
When it comes to the Branding world, the people doing Tourism Branding are wading in the kiddie pool compared to the people doing Pepsi’s “Grinning Idiot Rebrand” – who are plunging into the ocean’s abyss.
In August of 2024 I discovered a treasure trove of Artistic Printing from the Victorian age – the “Printers’ International Specimen Exchange” – and created a central page with every known image from it. Join me for a deeper look at the prints from that book!
Fun fact about Kwisatz Haderach: Aside from being a legendary guy from Frank Herbert’s work Dune, it actually means “shrinking of the path” in Hebrew. It refers to an ability that a Hasiddic Tzadik (holy man) had: shrinking what is normally a 3-day journey into just 1-day of travel.
What does that even mean when you’re in Victorian times? Well… it looks a lot like working at a printing press.
If you are a respectable person, you’d be a member of a printers’ guild. Printing books. Boring, reliable and informed by 400 years of tradition. After all, a single printed book transmits ideas down the generations.
Taylor shared a link to the online book Design for Real Life – which addresses how to design digital experiences that work for stressful, sensitive and urgent situations.
They’re getting cute and clever. But the context is that they fucked up. That’s not the time to joke around – it’s the time to be contrite and make it right.
My table structure and SELECT query was very simple. The database was going to be used for read-only access, without worrying too much about the integrity of the SQLite file (as it was re-created each day).
The database consisted of:
A “product” list with product name, vendor and brand. 120 thousand rows.
A “raw” time-series table representing the daily price for each product_id. 9 million rows.
A typical SQL query is below. And there was an index on raw.product_id.
SELECT nowtime, vendor, product_id, product_name, brand, current_price, old_price, units, price_per_unit, other
FROM raw
INNER JOIN product
ON raw.product_id = product.id
WHERE product.product_name LIKE '%bread%' OR product.brand LIKE '%bread%';
The problem
On my shared hosting, the query took a whopping 40 seconds to return the same results on the first run. If you re-ran the query, it took only 8 seconds on the server. On my personal laptop, that query took 198 milliseconds to return 215,000 rows.
Why was the first-time run so slow?
Spoiler: eventually I decreased the runtime of the SELECT + result fetching operation from 40 seconds to 13 seconds with the following pragmas. However, most of the speedup was due to rearranging row data within the SQLite file.
Note: I believe the query_only pragma only helps with proper error reporting, rather than speed of execution. But wanted to include it here just in case.
How to speed up SQLite reads
The problem turned out to be related to the slow hard-drive on my web host, and a phenomenon called a “cold start”.
Concept: the “cold start”
As discussed on this thread, the first time SQLite runs, it stars from a “blank slate” – none of the file data has been loaded into memory yet. So SQLite depends on the operating system to load content from disk to memory.
On the second run, all of that file content from before is still resident in the operating system’s memory cache. So the subsequent queries are much faster than the first. This explains why the first time I ran a query, it took 40s, and the second only took 8s.
The file’s contents go out of the memory cache after some time without access. If you are using CPanel, this phenomenon will appear in the “Resource Usage” page like the picture below. The yellow arrows highlight when an SQLite query is run – the OS loads a big portion of the file into memory and then gradually frees up that memory.
On my laptop’s SSD, the average speed is 882.6MB/sec – 58 times faster than the server.
This difference in hard drive speed explains why the query took so long to return the results on the server: physically seeking and retrieving the result rows took more time because of how slow the server disk was.
This is worth repeating: from benchmarking my application, I could see that the SELECT query was completing quickly (thanks Indexes!) – but the fetcing of the result rows was very time consuming.
Note: your index will take up extra space in the SQLite file. It is like its own table. You can use the sqlite3_analyzer.exe utility program that comes with SQLite in order to see just how much space the Index takes up. Note how an index on an INT field on the RAW table with 8mil rows takes up 16.8% of the entire file by size:
*** Page counts for all tables and indices separately ***
After you’re done creating your index, run the ANALYZE SQL statement. This will update the “query plan” for looking up data to consider the latest available resources. I found that SQLite will quietly ignore the index if you don’t run ANALYZE.
Q: How do I know that SQLite was ignoring the index? A: Run EXPLAIN QUERY PLAN (also addressed in Jason’s writeup about indexes)
What worked for me: rearranging your data on the disk
What ultimately sped up my queries was figuring out in what order SQLite will fetch the result rows for my queries, and to rearrange the data in the file itself so that result rows are likely to be near each other.(This is similar to creating a “Clustered Index” on other SQL engines, but SQLite doesn’t support this feature natively.)
Let me get more specific:
The query I was optimizing was looking for products that contained the word “bread” in their name or brand. Then, it would return any entries from the “raw” table that matched the same products.
The problem with the “raw” table was that it was arranged in chronological order – sorted by the date stamp by default.
Imagine you are the SQLite engine, trying to fetch all rows for product ID 1554. You would use the index to find out where, in the database file, all those entries are located. The first row would’ve been fetched on June 5th, and would be located in the middle of a pack of other product rows that were fetched on that day. To get that one row of data, you’d read an entire “page” (4,096 bytes) from file into memory. After that, the next entry for product 1554 would be found in the middle of a pack of rows fetched on June 6th. You’d fetch a whole bunch of data to get at that 1 row… and repeat and repeat and repeat.
The solution that worked for me was to position all the rows for “product 1554” near each other. To stop storing them in chronological order on disk. That way when you read in the first product entry, that samee block of 4,096 bytes will also contain the next several rows that you’re looking to fetch.
The runtime went from40 seconds to 17 seconds on the first query.
Bonus: how many rows fit within one 4,096 “page” of data? The sqlite3_analyzer.exe program can tell us. On the “raw” table, it shows that an average row takes up 48.48 bytes on disk. So that’s roughly 84 rows that are grabbed in a single disk “read” operation.
Rearrange the column order within a row
If you are storing large amounts of data in each row – close to the “page” size of 4Kb – then you might benefit from rearranging column order within your table definitiion.
That’s because you might be hitting a situation where fetching a single row requires more than 1 disk read operation. In those cases, a second (or third) read operation happens in order to get the full record. If your row contains several short “metadata” columns and a big “content” column, then it’s better to have those fields defined first – an SQL query might get the needed metadata in the first “page” and avoid reading all the pages that represent 1 row.
The page size is a kind of “block” unit in which SQLite stores rows within it’s file. Just like your filesystem’s block size. My understanding is that the SQLite page size should be equal to your filesystem’s block size.
To figure out the filesystem block size, try running stat -fc %s .
If you’re creating the SQLite file on one operating system but using it on a server that runs another, there’s a chance that they could have different default block sizes. In that case, you’ll need to set the block_size to match your <target> system.
The page_size pragma applies at DB creation time. After running it, you need to run VACUUM immediately after for your change to register.
Try: opening the SQLite file in immutable mode
When you open an SQLite file, you can specify a variety of “URI parameters” that change the way the file is treated. In Python, this looks like:
If your app uses the SQLite file for read only operations, you can try opening the file with the immutable=1 parameter. “Immutable” means that your app will not modify the SQLite file and assumes that neither will any other process. Immutable mode will cause the file to be opened in read-only mode, and SQLite will not attempt to lock the file when accessing – which will speed things up if your app is under heavy use. If, for some reason, someone does write to the DB file, it’ll cause an error or corruption.
SQLite creates temporary tables, indices, triggers, and views as part of regular query execution. In case that your DB is configured to write these to disk, you can set PRAGMA temp_store=2 in order explicitly create them in-memory only.
SQLite has a “rollback journal” – a journal is a special file that’s used to undo partial transactions during a crash, and prevents database corruption. By default, SQLite creates a separate file for this journal.
For my read-only / high-tolerance for DB corruption use case, it made sense to turn off journaling completely with the PRAGMA journal_mode=OFF statement. This shaved another second from my sample query’s runtime.
It sounds like this setting only affects write operations. But, my understanding is that any journaling transactions – even “read” transactions – will cause a write to disk. I think it’s implicit in this transaction doc, when it says:
While a read transaction is active, any changes to the database that are implemented by separate database connections will not be seen by the database connection that started the read transaction.
You’d turn off synchronous mode with PRAGMA synchronous=OFF. This setting stands to significantly speed up writes to your SQLite file, but at the risk of DB corruption (ie. an OS-level crash or a power outage).
Try: memory-mapped IO
Memory-mapped IO is a technique to seamlessly load your SQLite file into memory, and to perform regular file operations on it (except it is much faster because it is in-memory). Behind the scenes, this eliminates a memory-copy operation when reading. When writing, this doesn’t save you any cycles.
Memory mapping can be turned on with the below pragma (set it to a large number of bytes, and the OS will allocate just whatever is available):
For my use case, the impact of memory mapping was inconclusive – it either saved 3 seconds or added ~10 seconds to the runtime. I think it’s because mmap still requires reading a lot of data from the slow disk. Test it out on your own system. I ended up using it as part of my final “winning mix” of pragmas.
The sync mode for SQLite determines if SQLite makes sure that data/transactions have been actually written to disk before proceeding with the next operation. The options are described here.
With a read-only use case, I felt comfortable turning off the synch completely:
PRAGMA synchronous=OFF
This hands off any “writes” to the operating system without waiting for them to complete, and should speed things up considerably. Note that even read operations may end up writing to the filesystem.
I did it and it added 10seconds to runtime, so the mmap measurement was off somehow – maybe some stuff was still in disk cache memory…
Actually – test turningn OFF the memory mapping. It seems to have added a lot of execution time in span 34. Maybe it slows stuff down?
If you are doing heavy LIKE searches, look into the case-insensitive GLOB alternative and the COLLATE NOCASE keyword you can use in column definitions / individual queries. For case insensitive matches, lowercasing your data and doing a GLOB case-sensitive match will probably be faster than using LIKE with mixed case values.
Reminder: time your code
If your Python application is slow, use the time.time() function to determine how long different parts of your code take to run.
You might be looking in the wrong place for those slow-code areas.
At first, I was trying to figure out why csv.writerows is so slow on the first Flask run. Then I was thinking the problem was with io.stringio(), and tried to fid a way to preallocate memory to io.stringio(). Finally, I thought that the csv.writerows() code is too slow and started comparing python’s csv module performance to dataframe writing csv.
Eventually, by timing different portions of my code, and removing the “csv conversion” code (while keeping the SQLite fetchall() ), I narrowed the slowness down to SQLite’s row-fetching functionality.
Bonus: VACUUM
Is your SQLite file big and chunky?
Does it need to go on a diet?
SQLite has a behaviour that keeps your database file growing until you run the VACUUM command:
Frequent inserts, updates, and deletes can cause the database file to become fragmented – where data for a single table or index is scattered around the database file.
When content is deleted from an SQLite database, the content is not usually erased but rather the space used to hold the content is marked as being available for reuse.
Once you VACUUM your file, it is smaller, and reading from it should be faster.
Did this help with your project? If so, drop me a line at jacob at this website go give me the warm fuzzies!