Science & Space

Building a Unified Multimodal Content Engine: A Step-by-Step Guide for Travel Platforms

Posted by u/Fonarow · 2026-05-20 08:25:18

Introduction

In the competitive travel industry, connecting visual content — such as hotel images — with textual reviews unlocks deeper discovery. Platforms like Agoda have built a multimodal content system that unifies over 700 million images and multilingual guest reviews using a shared topic taxonomy. This system enables users to search and explore based on both what they see and what others have experienced. In this guide, you’ll learn how to recreate such a system with offline enrichment and low-latency serving, step by step.

Building a Unified Multimodal Content Engine: A Step-by-Step Guide for Travel Platforms — Source: www.infoq.com

What You Need

Data sources: A large corpus of hotel images (700M+ scale) and multilingual guest reviews (text in multiple languages).
Topic taxonomy: A curated ontology of travel-related topics (e.g., cleanliness, location, amenities, atmosphere) that can be applied to both images and text.
Computing infrastructure: Cluster with GPU support for training computer vision and NLP models, plus scalable storage (e.g., cloud object storage).
Offline processing pipeline: Tools for enrichment (e.g., Apache Spark or custom ETL) to tag images and reviews with taxonomy topics.
Low-latency serving stack: A vector database or search engine (e.g., Elasticsearch with dense retrieval, FAISS, or custom) for fast multimodal retrieval.
Machine learning models: Pre-trained or fine-tuned models for image classification/object detection and for multilingual NLP (e.g., sentence embeddings).
Engineering team: Skills in data engineering, ML, and backend development.

Step-by-Step Guide

Step 1: Define a Unified Topic Taxonomy

Begin by creating a shared topic taxonomy that bridges images and reviews. This taxonomy should cover key aspects of a hotel experience, such as:

Cleanliness (e.g., tidy rooms, clean bathrooms)
Location (e.g., proximity to attractions, quiet neighborhood)
Service (e.g., friendly staff, check-in efficiency)
Comfort (e.g., bed quality, room size)
Amenities (e.g., pool, breakfast, Wi-Fi)
Atmosphere (e.g., modern design, cozy vibe)

Each topic should be definable in both visual and textual terms. For example, “cleanliness” might be detected in images via object recognition (e.g., tidy bed) and in reviews via keyword extraction (e.g., “spotless”). Use domain experts and A/B testing to validate your taxonomy.

Step 2: Ingest and Preprocess Multimodal Data with Offline Enrichment

Aggregate all hotel images and multilingual reviews into a unified data lake. For images, perform deduplication and resize to a standard format. For reviews, normalize language using language detection, translate if necessary, and tokenize. Then, run an offline enrichment pipeline that applies your taxonomy to each data point:

Image enrichment: Use a pre-trained computer vision model (e.g., ResNet, EfficientNet) fine-tuned to recognize visual indicators of each topic. Generate confidence scores for each topic.
Review enrichment: Employ a multilingual NLP model (e.g., XLM-RoBERTa, LaBSE) to extract topic-relevant phrases and assign topic scores. Additionally, perform sentiment analysis to weight the reviews.

Store enriched metadata alongside raw data for indexing.

Step 3: Align Features Across Modalities

To enable retrieval, both images and reviews must be represented in a shared embedding space. Train or fine-tune a multimodal encoder (e.g., CLIP or ViLBERT) on your enriched dataset so that image embeddings and text embeddings for the same topic cluster together. Alternatively, use a two-tower architecture where images and reviews are encoded separately but trained with a contrastive loss on matching image-review pairs (e.g., images of a pool and reviews mentioning “great pool”). This step ensures that a search for “cozy atmosphere” can retrieve both images of cozy rooms and reviews describing a cozy vibe.

Step 4: Build a Unified Index for Multimodal Retrieval

Construct an index that supports querying across both images and reviews using the shared embedding space. Options include:

Vector database: Use systems like FAISS, Milvus, or Weaviate to index all embeddings. Store topic scores as metadata for filtering.
Elasticsearch with dense vectors: Enable dense retrieval alongside keyword search for hybrid querying.
Custom inverted index: Combine topic scores from both modalities into a single relevance score.

Ensure the index can handle scalability (700M images and corresponding reviews). Partition by geography or hotel cluster to reduce query latency.

Step 5: Implement Low-Latency Serving

Deploy a serving layer that handles real-time user queries with sub-second latency. Key considerations:

Caching: Cache frequent query–result pairs using a distributed cache (e.g., Redis).
Query rewriting: Augment user input (e.g., “quiet hotel”) with synonyms and related topics from the taxonomy.
Multi-modal fusion: Combine image and review scores using a weighted formula (e.g., 0.5 × image score + 0.5 × review score per topic).
Load balancing: Scale horizontally across multiple servers with auto-scaling based on traffic patterns.
Monitoring: Set up dashboards (e.g., Grafana) for latency, throughput, and retrieval quality metrics.

Step 6: Validate and Iterate

Test your system with real user interactions. Metrics to track:

Relevance: Use precision and recall at k (e.g., % of top-10 results that match user’s intended topic).
Engagement: Click-through rate on multimodal results versus single-modality baselines.
Latency: P99 response time under load.

Continuously update your taxonomy as travel trends change (e.g., post-pandemic emphasis on hygiene). Retrain models periodically with new data and feedback loops (e.g., user clicks as positive signals).

Tips for Success

Start small, scale gradually: Build a proof-of-concept with a subset of topics and smaller data before tackling 700M images.
Leverage pre-trained models: Fine-tuning existing multimodal transformers (like CLIP) requires less data and compute than training from scratch.
Multilingual alignment: Ensure your NLP model covers all languages present in your review corpus; consider using multilingual sentence embeddings (e.g., sentence-transformers/LaBSE).
Offline enrichment is key: Pre-computing topic scores avoids expensive inference during query time.
User feedback loops: Incorporate implicit feedback (clicks, dwell time) to refine topic weighting and model outputs.
Documentation and versioning: Keep clear records of taxonomy versions, model versions, and offline pipeline runs to facilitate debugging and reproduction.

Share Save Report