Most recommendation engines just show bestsellers and call it personalization

Many recommendation engines just show you bestsellers and call it personalization. They rank products by total sales or total views, occasionally filter by category, and serve the same list to every visitor. That isn't a recommendation. It's a popularity contest with a UI on top, and calling it personalized doesn't change what's underneath. The visitor who spent ten minutes browsing leather bags gets the same list as the visitor who looked at wallets for thirty seconds, because the underlying ranking has nothing to do with the visitor's session — it has to do with the all-time totals on the catalog. The personalization label is doing a lot of marketing work that the engine itself isn't doing.

The data was already there

Every Shopify store generates behavioral signals that most tools ignore completely. Visitor A views product X, then later buys product Y. Visitor B views the same product X, also buys Y. Run that pattern across thousands of sessions and a co-purchase signal emerges between X and Y — products that don't necessarily look related in the catalog (different category, different keywords, different photographs) but are clearly connected in how people actually shop. The signal is sitting there in every store's analytics, available for any tool that bothers to compute it, and the tools that don't compute it are leaving the most useful information about visitor behavior on the floor.

The compute side of this used to be a real obstacle. Building a behavioral recommendation system in 2015 meant standing up infrastructure for event ingestion, batch processing, model training, serving, monitoring — most of a small data team's quarter, and the result was a system that needed continuous attention. In 2026, the infrastructure cost has collapsed. BigQuery handles the event warehouse essentially for free at small-store volumes. The math behind co-occurrence signals is well-understood and runs on commodity hardware. Pre-trained models for content similarity are available off the shelf. Standing up a behavioral recommendation engine for an individual store is a weekend of work for a competent engineer. Not building one when the data is right there is a small crime against the storefront's revenue.

The four signals and what each one captures

A useful behavioral recommendation system blends multiple signal types because no single signal captures the whole picture of what visitors are doing. The four that matter, ranked by how predictive they tend to be in practice:

Co-views carry the most weight in the blend, around 0.45 of the score. Two products that get viewed in the same session are almost always part of the same shopping consideration set, even if the visitor didn't click into one or buy either. Co-views are noisy at the individual session level — a single visitor's path through the catalog might be random — but they're remarkably predictive in aggregate, because shopping sessions tend to organize around a category or a use case and the views within a session reflect that organization. The co-view signal captures the comparison set the visitor was working with.

Co-clicks weight around 0.30. The product card click is a stronger signal than the page view because it indicates the visitor was interested enough in the product to investigate further. Two products that get clicked in the same session reflect a tighter comparison than two that just got viewed in passing. Co-clicks also handle a failure mode of co-views well: a visitor who scrolled past a category page and triggered a hundred low-attention impressions will produce a lot of co-view noise, but the co-click data only counts the products that pulled enough interest to merit a click.

Content similarity weights around 0.20. This is the only signal that doesn't come from behavioral data — it comes from the products themselves, embedded into a vector space using a transformer model, with similarity computed as cosine distance between embeddings. Content similarity is the safety net for cold-start problems. New products that don't have behavioral data yet get reasonable recommendations through similarity to the existing catalog. Niche products that don't co-occur with much get recommendations through what they look like and what their descriptions say. The content signal isn't as predictive as the behavioral signals when behavioral signals exist, but it stops the system from collapsing on the long tail of the catalog.

Co-purchases weight around 0.05, the smallest of the four. The signal is high-quality — a co-purchase is the strongest evidence that two products belong together — but it's also the rarest, because most sessions don't end in a purchase, and most purchases are single-item. Treating co-purchase as a strong-but-rare signal in the blend keeps it from dominating the ranking when it does fire, while still letting it nudge the score upward for product pairs that demonstrably end in the same cart.

Why the weights aren't ML buzzwords

There is no machine learning model behind the blend. The four signals are computed independently, each one a count or a similarity score, and they're combined with the explicit weights above into a single per-pair score. Nightly batch refresh recomputes the signals from the previous window of behavioral data. The recommendation for any given source product is the top N other products by blended score, with optional merchant overrides for boost or suppress on specific items.

This is co-occurrence math, not predictive modeling. The reason it produces good recommendations isn't that it has learned anything subtle about visitor preferences. It's that the four signals capture different aspects of the relationship between products, and the blend is robust against any one signal being noisy on a particular pair. When the co-purchase signal is sparse, the co-view signal carries the recommendation. When the behavioral signals are sparse for a new product, the content similarity carries it. The system never needs to be smart in a deep-learning sense because the underlying data is already organized by the visitors' actual behavior, and the weights are just the recipe for blending the four views of that data into one ranking.

The temptation to put an ML buzzword on the system is real, because "AI-powered recommendation engine" sells better in a marketing pitch than "co-occurrence math on real shopping behavior." The honest version is more boring and more accurate. Visitors don't care which adjectives describe the engine. They care whether the products on the recovery page look like things they were actually considering, which the blended-signal approach achieves without needing any of the AI vocabulary the category has appropriated.

What the difference shows up as

The end-result difference between a bestseller list and a context-aware recommendation is large and easy to measure. A bestseller list on a recovery page converts at maybe 1-2% click-through, depending on the vertical. The same recovery page rendering context-aware recommendations from the same store's behavioral data tends to land in the 8-11% range. That's a 4-5x difference on the same surface, with the same visitor population, the same page layout, and the same store. The only variable is the ranking logic underneath the grid, and the variable accounts for most of the engagement.

The reason the difference is so large is that the bestseller list is showing every visitor the same products, which means it's mismatched with most of the visitors. The visitor who was looking at leather bags sees the bestseller list and the bestseller list shows phone cases, candles, t-shirts — whatever the store sells the most of overall. The list isn't wrong about what the store's bestsellers are; it's just irrelevant to the visitor's session. Context-aware recommendations show the leather-bag visitor more leather bags, plus the wallets and accessories that other leather-bag visitors actually bought. The grid stops being random with respect to the session and starts being a continuation of the search the visitor was doing.

The system also gets better over time without any explicit training step, because every conversion from a recommendation feeds back into the co-purchase and co-click signals, and every dismissed recommendation feeds back into a slightly lower co-view weight for that pair on the next refresh. The improvement is gradual and bounded — there is no neural network learning to be smarter, just the signals shifting slightly each night based on what the visitors did the day before. But the gradual shift compounds, and stores that have been running the engine for six months end up with rankings that reflect their specific catalog and audience much more tightly than the rankings looked at install.

The closing read

The honest summary of how a behavioral recommendation engine works is that it's mostly bookkeeping. Count co-views. Count co-clicks. Count co-purchases. Compute content similarity. Blend the four with weights. Refresh nightly. Serve the top N by blended score. There's no model, no training pipeline, no inference cost worth talking about, no AI badge that belongs on the dashboard. What the system has, that the bestseller-list approach lacks, is the discipline of using the data the store already generates to organize the catalog around what visitors actually do, rather than around what the store has shipped the most of historically. That discipline accounts for most of the difference between a recommendation grid that converts at 2% and one that converts at 10%, and the discipline is the thing the category has been mostly skipping.