🛒 Amazon Product Price Prediction Model

Multimodal deep learning model for predicting Amazon product prices from images, text, and metadata

📊 Model Performance

Metric	Value	Benchmark
SMAPE	36.5%	Top 3% (Competition)
MAE	$5.82	-22.5% vs baseline
MAPE	28.4%	Industry-leading
R²	0.847	Strong correlation
Median Error	$3.21	Robust predictions

Training Data: 75,000 Amazon products
Architecture: CLIP ViT-L/14 + Enhanced Multi-head Attention + 40+ Features
Parameters: 395M total, 78M trainable (19.8%)

🎯 Quick Start

Installation

pip install torch torchvision open_clip_torch peft pillow
pip install huggingface_hub datasets transformers

Load Model

from huggingface_hub import hf_hub_download
import torch

# Download model checkpoint
model_path = hf_hub_download(
    repo_id="shawneil/Amazon-ml-Challenge-Model",
    filename="best_model.pt"
)

# Load model (see GitHub repo for complete model definition)
model = OptimizedCLIPPriceModel(clip_model)
model.load_state_dict(torch.load(model_path, map_location='cpu'))
model.eval()

Inference Example

from PIL import Image
import open_clip
import torch

# Load CLIP processor
clip_model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14', pretrained='openai'
)
tokenizer = open_clip.get_tokenizer('ViT-L-14')

# Prepare inputs
image = Image.open("product_image.jpg")
image_tensor = preprocess(image).unsqueeze(0)

text = "Premium Organic Coffee Beans, 16 oz, Medium Roast"
text_tokens = tokenizer([text])

# Extract 40+ features (see feature engineering guide)
features = extract_features(text)  # Your feature extraction function
features_tensor = torch.tensor(features).unsqueeze(0)

# Predict price
with torch.no_grad():
    predicted_price = model(image_tensor, text_tokens, features_tensor)
    print(f"Predicted Price: ${predicted_price.item():.2f}")

🏗️ Model Architecture

Overview

Product Image (512×512) ──┐
                          ├──> CLIP Vision (ViT-L/14) ──┐
Product Text ─────────────┼──> CLIP Text Transformer ───┤
                          │                              ├──> Feature Attention ──> Enhanced Head ──> Price
40+ Features ─────────────┘                              │     (Self-Attn + Gate)    (Dual-path +
(Quantities, Categories,                                 │                           Cross-Attn)
 Brands, Quality, etc.)                                  │

Key Components

Vision Encoder: CLIP ViT-L/14 (304M params, last 6 blocks trainable)
Text Encoder: CLIP Transformer (123M params, last 4 blocks trainable)
Feature Engineering: 40+ handcrafted features
Attention Fusion: Multi-head self-attention + gating mechanism
Price Head: Dual-path architecture with 8-head cross-attention + LoRA (r=48)

Trainable Parameters

Vision: 25.6M params (8.4% of vision encoder)
Text: 16.2M params (13.2% of text encoder)
Price Head: 4.2M params (LoRA fine-tuning)
Feature Gate: 0.8M params
Total Trainable: 78M / 395M (19.8%)

🔬 Feature Engineering (40+ Features)

1. Quantity Features (6)

Weight normalization (oz → standardized)
Volume normalization (ml → standardized)
Multi-pack detection
Unit per oz/ml ratios

2. Category Detection (6)

Food & Beverages
Electronics
Beauty & Personal Care
Home & Kitchen
Health & Supplements
Spices & Seasonings

3. Brand & Quality Indicators (7)

Brand score (capitalization analysis)
Premium keywords (17 indicators: "Premium", "Organic", "Artisan", etc.)
Budget keywords (7 indicators: "Value Pack", "Budget", etc.)
Special diet flags (vegan, gluten-free, kosher, halal)
Quality composite score

4. Bulk & Packaging (4)

Bulk detection
Single serve flag
Family size flag
Pack size analysis

5. Text Statistics (5)

Character/word counts
Bullet point extraction
Description richness
Catalog completeness

6. Price Signals (4)

Price tier indicators
Quality-adjusted signals
Category-quantity interactions

7. Unit Economics (5)

Weight/volume per count
Value per unit
Normalized quantities

8. Interaction Features (3+)

Brand × Premium
Category × Quantity
Multiple composite features

📈 Training Details

Dataset

Training: 75,000 Amazon products
Validation: 15,000 samples (20% split)
Format: Parquet (images as bytes + metadata)
Source: shawneil/hackathon

Hyperparameters

{
    "epochs": 3,
    "batch_size": 32,
    "gradient_accumulation": 2,
    "effective_batch_size": 64,
    "learning_rate": {
        "vision": 1e-6,
        "text": 1e-6,
        "head": 1e-4
    },
    "optimizer": "AdamW (betas=(0.9, 0.999), weight_decay=0.01)",
    "scheduler": "CosineAnnealingLR with warmup (500 steps)",
    "gradient_clip": 0.5,
    "mixed_precision": "fp16"
}

Loss Function (6 Components)

Total Loss = 0.05×Huber + 0.05×MSE + 0.65×SMAPE + 
             0.15×PercentageError + 0.05×WeightedMAE + 0.05×QuantileLoss

Where:
- SMAPE: Primary competition metric (65% weight)
- Percentage Error: Relative error focus (15%)
- Huber: Robust regression (δ=0.8)
- Weighted MAE: Price-aware weighting (1/price)
- Quantile: Median regression (τ=0.5)
- MSE: Standard regression baseline

Training Environment

Hardware: 2× NVIDIA T4 GPUs (16 GB each)
Time: ~54 minutes (3 epochs)
Memory: ~6.4 GB per GPU
Framework: PyTorch 2.0+, CUDA 11.8

🎯 Use Cases

E-commerce Applications

New Product Pricing: Predict optimal prices for new listings
Competitive Analysis: Benchmark against market prices
Dynamic Pricing: Automated price adjustments
Inventory Valuation: Estimate product worth

Business Intelligence

Market Research: Price trend analysis
Category Insights: Pricing patterns by category
Brand Positioning: Premium vs budget detection

📊 Performance by Category

Category	% of Data	SMAPE	MAE	Best Range
Food & Beverages	40%	34.8%	$5.12	$5-$25
Electronics	15%	39.1%	$8.94	$25-$100
Beauty	20%	35.6%	$4.87	$10-$50
Health	15%	37.3%	$6.24	$15-$40
Spices	5%	33.2%	$3.91	$5-$15
Other	5%	42.7%	$7.18	Varies

Best Performance: Low to mid-price items ($5-$50) covering 88% of products

🔍 Limitations & Bias

Known Limitations

High-price items: Lower accuracy for products >$100 (58.2% SMAPE)
Rare categories: Limited training data for niche products
Seasonal pricing: Doesn't account for time-based variations
Regional differences: Trained on US prices only

Potential Biases

Brand bias: May favor well-known brands
Category imbalance: Better on food/beauty vs electronics
Price range: Optimized for $5-$50 range

Recommendations

Use ensemble predictions for high-value items
Add category-specific post-processing
Combine with rule-based systems for edge cases
Monitor performance on new product categories

🛠️ Model Versions

Version	Date	SMAPE	Changes
v2.0	2025-01	36.5%	Enhanced features + architecture
v1.0	2025-01	45.8%	Baseline with 17 features
v0.1	2024-12	52.3%	CLIP-only (frozen)

📚 Citation

@misc{rodrigues2025amazon,
  title={Amazon Product Price Prediction using Multimodal Deep Learning},
  author={Rodrigues, Shawneil},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/shawneil/Amazon-ml-Challenge-Model}},
  note={SMAPE: 36.5\%}
}

📞 Resources

GitHub Repository: Amazon-ml-Challenge-Smape-score-36
Training Dataset: shawneil/hackathon
Test Dataset: shawneil/hackstest
Documentation: See GitHub repo for detailed guides

📄 License

MIT License - See LICENSE

🙏 Acknowledgments

OpenAI for CLIP pre-trained models
Hugging Face for hosting infrastructure
Amazon ML Challenge for dataset and competition

Built with ❤️ using PyTorch, CLIP, and smart feature engineering

From 52.3% to 36.5% SMAPE - Multimodal learning at its best

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shawneil/Multi-Modal-Price-Predictor

Base model

openai/clip-vit-large-patch14

Adapter

(3)

this model

shawneil
/

Multi-Modal-Price-Predictor