2025.04.26

Deep Dive | “LLMs Can See and Hear Without Any Training”? My Skeptical Take on the MILS Paper

info@appfreelife.com

TL;DR

Meta proposed MILS (Multimodal Iterative LLM Solver), claiming that large language models (LLMs) can directly handle images, videos, and audio tasks without any multimodal training.

My conclusion:

MILS is a highly creative inference optimization technique, suitable for showcasing LLMs’ reasoning potential, but it does not mean that LLMs have truly acquired perceptual abilities.
The entire process relies entirely on external pre-trained multimodal scorers (such as CLIP, SigLIP, etc.); the LLM itself does not actually “see” or “hear” the media input.
The success of MILS depends on black-box optimization through repeated guessing and feedback scoring, rather than any true perceptual understanding by the LLM.
Although it avoids the cost of retraining a new model, each inference iteration consumes far more computational resources compared to traditional multimodal models.
I simulated the MILS method using a simple number-guessing experiment and confirmed: even without any perceptual ability, it is possible to progressively approach the correct answer purely through massive random guessing and feedback — but this is brute-force reasoning, not real understanding.

1. Quick Overview: How MILS Actually Works

Input: A test image (or video/audio clip).

Initialization: Load 30,000 candidate descriptions. For each image, compute similarity scores between all 30K descriptions and the image using CLIP (or SigLIP). Select the top 50 high-scoring descriptions as the initial pool.
Generator (GENERATOR): Based on the pool, the LLM generates a batch of candidate descriptions or instructions (50 candidates per iteration in the paper).
Scorer (SCORER): Models like SigLIP, ViCLIP, and ImageBind are used to calculate similarity scores between each text candidate and the media input.
Feedback Loop: Feed the scores and the top candidates back into the LLM to guide it in generating better descriptions.
Repeat: Iterate N times (10 iterations in the paper) and keep the highest-scoring description at the end.

My understanding:
In simple terms, this method merely uses multiple rounds of guessing, guided by score feedback, to steer the LLM’s outputs toward the correct answer.
It’s similar to how a blindfolded person might eventually guess what’s in a picture through tens of thousands of trials based purely on feedback — it’s brute-force optimization powered by massive computation, not true perception.

2. Why I Remain Skeptical of the Title Claim

2.1 The LLM Never Actually Sees or Hears Anything

The entire “perception” comes from the external scorer.
Remove SigLIP/ImageBind, and the LLM would still be completely blind and deaf.

2.2 The Performance Relies Heavily on Pretraining

Even though the LLM itself isn’t further fine-tuned, the entire process is driven by a black-box optimization: “scoring → rewriting → re-scoring.”
The real burden of perception and semantic evaluation is still carried by heavily pre-trained multimodal scorers.

2.3 Hidden Massive Computational Costs

For example, in the MILS setup:

An initial scoring of 30K descriptions must be done via CLIP.
Then, 50 candidates × 10 rounds = 500 total generations needed afterward.

Even though batch processing is possible, there remains a hidden and significant computational cost.

2.4 Risk of Overfitting to the Scorer

Since optimization is based solely on a single feedback score, the LLM can easily overfit to the scoring model’s specific preferences (e.g., overemphasis on color words) without truly understanding the input content.

3. Further Reading and Resources

Original Paper: LLMs Can See and Hear Without Any Training (arXiv:2501.18096)
Official Code Repository: MILS GitHub Repo

Conclusion

Through a brute-force optimization strategy powered by massive compute, MILS showcases one possible “black box” face of deep learning.
Just like a blind person cannot actually see an image, but could guess it through endless trials and feedback, and remember the correct answer’s parameters, this does not mean the blind person can suddenly see.

Thus, LLMs do not actually grow eyes and ears through MILS;
They simply borrow the capabilities of well-trained multimodal scorers, at the cost of significant computational overhead.

Additional Validation: A Simple Number Guessing Experiment

To further validate this idea, I designed a simple number-guessing experiment that mimics the MILS paper’s method:
(You can check out the full simulation in Google Colab.)

Simulation Steps:

Randomly generate 30,000 initial guesses.
Select the top 50 closest guesses based on distance to the true answer.
Set the new min/max range based on these 50 guesses.
In each iteration:
- Randomly generate 50 new guesses within the current range.
- Pick the best new guess and update the range.
After 10 rounds, output the final best guess and its percent error compared to the true answer.

What This Experiment Shows:

Even without any “vision” or “hearing,” pure feedback-driven guessing can progressively converge towards the answer.
The final result is merely the outcome of massive trial-and-error and filtering, not actual “understanding” of the input.
Therefore, the success of MILS does not mean LLMs have acquired real perceptual abilities without training —
It simply demonstrates an external model-assisted brute-force optimization.

ABOUT ME

Deep Dive | “LLMs Can See and Hear Without Any Training”? My Skeptical Take on the MILS Paper

TL;DR

1. Quick Overview: How MILS Actually Works

2. Why I Remain Skeptical of the Title Claim

2.1 The LLM Never Actually Sees or Hears Anything