{"id":361,"date":"2025-04-26T10:10:19","date_gmt":"2025-04-26T01:10:19","guid":{"rendered":"https:\/\/appfreelife.com\/?p=361"},"modified":"2025-04-26T10:13:44","modified_gmt":"2025-04-26T01:13:44","slug":"deep-dive-llms-can-see-and-hear-without-any-training-my-skeptical-take-on-the-mils-paper","status":"publish","type":"post","link":"https:\/\/appfreelife.com\/?p=361","title":{"rendered":"Deep Dive | \u201cLLMs Can See and Hear Without Any Training\u201d?\u00a0 My Skeptical Take on the MILS Paper"},"content":{"rendered":"\n<h2 class=\"wp-block-heading jinr-heading d--bold\"><strong>TL;DR<\/strong><\/h2>\n\n\n\n<p>Meta proposed <strong>MILS<\/strong> (Multimodal Iterative LLM Solver), claiming that large language models (LLMs) can directly handle images, videos, and audio tasks without any multimodal training.<\/p>\n\n\n\n<p>My conclusion:<\/p>\n\n\n\n<ul class=\"wp-block-list jinr-list\">\n<li><strong>MILS is a highly creative inference optimization technique, suitable for showcasing LLMs&#8217; reasoning potential, but it does not mean that LLMs have truly acquired perceptual abilities.<\/strong><\/li>\n\n\n\n<li>The entire process <strong>relies entirely on external pre-trained multimodal scorers<\/strong> (such as CLIP, SigLIP, etc.); the LLM itself does not actually &#8220;see&#8221; or &#8220;hear&#8221; the media input.<\/li>\n\n\n\n<li>The success of MILS depends on <strong>black-box optimization through repeated guessing and feedback scoring<\/strong>, rather than any true perceptual understanding by the LLM.<\/li>\n\n\n\n<li>Although it avoids the cost of retraining a new model, <strong>each inference iteration consumes far more computational resources<\/strong> compared to traditional multimodal models.<\/li>\n\n\n\n<li>I simulated the MILS method using a simple number-guessing experiment and confirmed: <strong>even without any perceptual ability, it is possible to progressively approach the correct answer purely through massive random guessing and feedback \u2014 but this is brute-force reasoning, not real understanding.<\/strong><\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading jinr-heading d--bold\"><strong>1. Quick Overview: How MILS Actually Works<\/strong><\/h2>\n\n\n\n<p><strong>Input:<\/strong> A test image (or video\/audio clip).<\/p>\n\n\n\n<ul class=\"wp-block-list jinr-list\">\n<li><strong>Initialization:<\/strong> Load 30,000 candidate descriptions. For each image, compute similarity scores between all 30K descriptions and the image using CLIP (or SigLIP). Select the top 50 high-scoring descriptions as the initial pool.<br><\/li>\n\n\n\n<li><strong>Generator (GENERATOR):<\/strong> Based on the pool, the LLM generates a batch of candidate descriptions or instructions (50 candidates per iteration in the paper).<br><\/li>\n\n\n\n<li><strong>Scorer (SCORER):<\/strong> Models like SigLIP, ViCLIP, and ImageBind are used to calculate similarity scores between each text candidate and the media input.<br><\/li>\n\n\n\n<li><strong>Feedback Loop:<\/strong> Feed the scores and the top candidates back into the LLM to guide it in generating better descriptions.<br><\/li>\n\n\n\n<li><strong>Repeat:<\/strong> Iterate N times (10 iterations in the paper) and keep the highest-scoring description at the end.<br><\/li>\n<\/ul>\n\n\n\n<p><strong>My understanding:<\/strong><strong><br><\/strong> In simple terms, this method merely uses multiple rounds of guessing, guided by score feedback, to steer the LLM\u2019s outputs toward the correct answer.<br>It\u2019s similar to how a blindfolded person might eventually guess what&#8217;s in a picture through tens of thousands of trials based purely on feedback \u2014 <strong>it&#8217;s brute-force optimization powered by massive computation, not true perception<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading jinr-heading d--bold\"><strong>2. Why I Remain Skeptical of the Title Claim<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading jinr-heading d--bold\"><strong>2.1 The LLM Never Actually Sees or Hears Anything<\/strong><\/h3>\n\n\n\n<p>The entire &#8220;perception&#8221; comes from the external scorer.<br>Remove SigLIP\/ImageBind, and the LLM would still be completely blind and deaf.<\/p>\n\n\n\n<h3 class=\"wp-block-heading jinr-heading d--bold\"><strong>2.2 The Performance Relies Heavily on Pretraining<\/strong><\/h3>\n\n\n\n<p>Even though the LLM itself isn&#8217;t further fine-tuned, the entire process is driven by a black-box optimization: &#8220;scoring \u2192 rewriting \u2192 re-scoring.&#8221;<br>The real burden of perception and semantic evaluation is still carried by <strong>heavily pre-trained multimodal scorers<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading jinr-heading d--bold\"><strong>2.3 Hidden Massive Computational Costs<\/strong><\/h3>\n\n\n\n<p>For example, in the MILS setup:<\/p>\n\n\n\n<ul class=\"wp-block-list jinr-list\">\n<li>An initial scoring of 30K descriptions must be done via CLIP.<br><\/li>\n\n\n\n<li>Then, 50 candidates \u00d7 10 rounds = 500 total generations needed afterward.<br><\/li>\n<\/ul>\n\n\n\n<p>Even though batch processing is possible, there remains a <strong>hidden and significant computational cost<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading jinr-heading d--bold\"><strong>2.4 Risk of Overfitting to the Scorer<\/strong><\/h3>\n\n\n\n<p>Since optimization is based solely on a single feedback score, the LLM can easily overfit to the scoring model\u2019s specific preferences (e.g., overemphasis on color words) without truly understanding the input content.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading jinr-heading d--bold\"><strong>3. Further Reading and Resources<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list jinr-list\">\n<li>Original Paper:<a href=\"https:\/\/arxiv.org\/abs\/2501.18096\"> LLMs Can See and Hear Without Any Training (arXiv:2501.18096)<br><\/a><\/li>\n\n\n\n<li>Official Code Repository:<a href=\"https:\/\/github.com\/facebookresearch\/MILS\"> MILS GitHub Repo<br><\/a><\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading jinr-heading d--bold\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p>Through a brute-force optimization strategy powered by massive compute, MILS showcases one possible &#8220;black box&#8221; face of deep learning.<br>Just like a blind person cannot actually see an image, but could guess it through endless trials and feedback, and remember the correct answer&#8217;s parameters, this <strong>does not mean the blind person can suddenly see<\/strong>.<\/p>\n\n\n\n<p>Thus, LLMs do not actually grow eyes and ears through MILS;<br>They simply <strong>borrow the capabilities of well-trained multimodal scorers<\/strong>, at the cost of significant computational overhead.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading jinr-heading d--bold\"><strong>Additional Validation: A Simple Number Guessing Experiment<\/strong><\/h2>\n\n\n\n<p>To further validate this idea, I designed a simple number-guessing experiment that mimics the MILS paper\u2019s method:<br>(You can check out the full simulation in <a href=\"https:\/\/colab.research.google.com\/drive\/130tann2e7gqu9H1SvlApY-he1ISPuTCE?usp=sharing\">Google Colab<\/a>.)<\/p>\n\n\n\n<h3 class=\"wp-block-heading jinr-heading d--bold\"><strong>Simulation Steps:<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list jinr-list\">\n<li>Randomly generate 30,000 initial guesses.<br><\/li>\n\n\n\n<li>Select the top 50 closest guesses based on distance to the true answer.<br><\/li>\n\n\n\n<li>Set the new min\/max range based on these 50 guesses.<br><\/li>\n\n\n\n<li>In each iteration:<br>\n<ul class=\"wp-block-list jinr-list\">\n<li>Randomly generate 50 new guesses within the current range.<br><\/li>\n\n\n\n<li>Pick the best new guess and update the range.<br><\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>After 10 rounds, output the final best guess and its percent error compared to the true answer.<br><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading jinr-heading d--bold\"><strong>What This Experiment Shows:<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list jinr-list\">\n<li>Even without <strong>any &#8220;vision&#8221; or &#8220;hearing,&#8221;<\/strong> pure feedback-driven guessing can progressively converge towards the answer.<br><\/li>\n\n\n\n<li>The final result is merely the outcome of <strong>massive trial-and-error and filtering<\/strong>, not actual &#8220;understanding&#8221; of the input.<br><\/li>\n\n\n\n<li>Therefore, the success of MILS does <strong>not<\/strong> mean LLMs have acquired real perceptual abilities without training \u2014<br>It simply demonstrates an <strong>external model-assisted brute-force optimization<\/strong>.<br><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>TL;DR Meta proposed MILS (Multimodal Iterative LLM Solver), claiming that large language models (LLMs) can dir [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":362,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jinr_url_youtube":"","_jinr_pip_youtube":false,"_jinr_time_youtube":"","_jinr_thumb_youtube":"","_jinr_media_youtube":"","_jinr_category_edit":false,"_jinr_category":"","_jinr_title_display":false,"_jinr_snsbutton_display":false,"_jinr_ads_display":false,"_jinr_thumbnail_display":false,"_jinr_profile_display":false,"_jinr_representations_display":false,"_jinr_relatedpost_display":false,"_jinr_sidebar1col_display":false,"_jinr_sidebar2col_display":false,"_jinr_seotitle_display":"","_jinr_description_display":"","_jinr_keyword_display":"","_jinr_hastag_display":"","_jinr_canonical_display":"","_jinr_noindex_display":false,"_jinr_paidpost":false,"_jinr_paidpost_product_id":"","_jinr_headtag_article":"","footnotes":""},"categories":[4],"tags":[],"class_list":["post-361","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-appintro"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.9 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Deep Dive | \u201cLLMs Can See and Hear Without Any Training\u201d?\u00a0 My Skeptical Take on the MILS Paper - \u30a2\u30d7\u30ea\u526f\u696d\u30e9\u30dc\uff01<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/appfreelife.com\/?p=361\" \/>\n<meta property=\"og:locale\" content=\"ja_JP\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Deep Dive | \u201cLLMs Can See and Hear Without Any Training\u201d?\u00a0 My Skeptical Take on the MILS Paper - \u30a2\u30d7\u30ea\u526f\u696d\u30e9\u30dc\uff01\" \/>\n<meta property=\"og:description\" content=\"TL;DR Meta proposed MILS (Multimodal Iterative LLM Solver), claiming that large language models (LLMs) can dir [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/appfreelife.com\/?p=361\" \/>\n<meta property=\"og:site_name\" content=\"\u30a2\u30d7\u30ea\u526f\u696d\u30e9\u30dc\uff01\" \/>\n<meta property=\"article:published_time\" content=\"2025-04-26T01:10:19+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-26T01:13:44+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/appfreelife.com\/wp-content\/uploads\/2025\/04\/The-AI-Who-Couldnt-See-600-x-700-px.png\" \/>\n\t<meta property=\"og:image:width\" content=\"700\" \/>\n\t<meta property=\"og:image:height\" content=\"700\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"info@appfreelife.com\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u57f7\u7b46\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"info@appfreelife.com\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u63a8\u5b9a\u8aad\u307f\u53d6\u308a\u6642\u9593\" \/>\n\t<meta name=\"twitter:data2\" content=\"13\u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/appfreelife.com\/?p=361#article\",\"isPartOf\":{\"@id\":\"https:\/\/appfreelife.com\/?p=361\"},\"author\":{\"name\":\"info@appfreelife.com\",\"@id\":\"https:\/\/appfreelife.com\/#\/schema\/person\/642f91ef444c469d236f2a18aeec68d7\"},\"headline\":\"Deep Dive | \u201cLLMs Can See and Hear Without Any Training\u201d?\u00a0 My Skeptical Take on the MILS Paper\",\"datePublished\":\"2025-04-26T01:10:19+00:00\",\"dateModified\":\"2025-04-26T01:13:44+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/appfreelife.com\/?p=361\"},\"wordCount\":800,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/appfreelife.com\/#\/schema\/person\/642f91ef444c469d236f2a18aeec68d7\"},\"image\":{\"@id\":\"https:\/\/appfreelife.com\/?p=361#primaryimage\"},\"thumbnailUrl\":\"https:\/\/appfreelife.com\/wp-content\/uploads\/2025\/04\/The-AI-Who-Couldnt-See-600-x-700-px.png\",\"articleSection\":[\"\u30a2\u30d7\u30ea\u7d39\u4ecb\u306b\u3064\u3044\u3066\"],\"inLanguage\":\"ja\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/appfreelife.com\/?p=361#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/appfreelife.com\/?p=361\",\"url\":\"https:\/\/appfreelife.com\/?p=361\",\"name\":\"Deep Dive | \u201cLLMs Can See and Hear Without Any Training\u201d?\u00a0 My Skeptical Take on the MILS Paper - \u30a2\u30d7\u30ea\u526f\u696d\u30e9\u30dc\uff01\",\"isPartOf\":{\"@id\":\"https:\/\/appfreelife.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/appfreelife.com\/?p=361#primaryimage\"},\"image\":{\"@id\":\"https:\/\/appfreelife.com\/?p=361#primaryimage\"},\"thumbnailUrl\":\"https:\/\/appfreelife.com\/wp-content\/uploads\/2025\/04\/The-AI-Who-Couldnt-See-600-x-700-px.png\",\"datePublished\":\"2025-04-26T01:10:19+00:00\",\"dateModified\":\"2025-04-26T01:13:44+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/appfreelife.com\/?p=361#breadcrumb\"},\"inLanguage\":\"ja\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/appfreelife.com\/?p=361\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\/\/appfreelife.com\/?p=361#primaryimage\",\"url\":\"https:\/\/appfreelife.com\/wp-content\/uploads\/2025\/04\/The-AI-Who-Couldnt-See-600-x-700-px.png\",\"contentUrl\":\"https:\/\/appfreelife.com\/wp-content\/uploads\/2025\/04\/The-AI-Who-Couldnt-See-600-x-700-px.png\",\"width\":700,\"height\":700},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/appfreelife.com\/?p=361#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\u30db\u30fc\u30e0\",\"item\":\"https:\/\/appfreelife.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Deep Dive | \u201cLLMs Can See and Hear Without Any Training\u201d?\u00a0 My Skeptical Take on the MILS Paper\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/appfreelife.com\/#website\",\"url\":\"https:\/\/appfreelife.com\/\",\"name\":\"\u30a2\u30d7\u30ea\u526f\u696d\u30e9\u30dc\uff01\",\"description\":\"AI\u30a2\u30d7\u30ea\u958b\u767a\u3067\u3001\u81ea\u7531\u306a\u50cd\u304d\u65b9\u3092\u3064\u304b\u3082\u3046\u3002\",\"publisher\":{\"@id\":\"https:\/\/appfreelife.com\/#\/schema\/person\/642f91ef444c469d236f2a18aeec68d7\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/appfreelife.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ja\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/appfreelife.com\/#\/schema\/person\/642f91ef444c469d236f2a18aeec68d7\",\"name\":\"info@appfreelife.com\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\/\/appfreelife.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/appfreelife.com\/wp-content\/uploads\/2025\/04\/cropped-App\u526f\u696d\u30e9\u30dcicon2.png\",\"contentUrl\":\"https:\/\/appfreelife.com\/wp-content\/uploads\/2025\/04\/cropped-App\u526f\u696d\u30e9\u30dcicon2.png\",\"width\":512,\"height\":512,\"caption\":\"info@appfreelife.com\"},\"logo\":{\"@id\":\"https:\/\/appfreelife.com\/#\/schema\/person\/image\/\"},\"sameAs\":[\"http:\/\/appfreelife.com\"],\"url\":\"https:\/\/appfreelife.com\/?author=1\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Deep Dive | \u201cLLMs Can See and Hear Without Any Training\u201d?\u00a0 My Skeptical Take on the MILS Paper - \u30a2\u30d7\u30ea\u526f\u696d\u30e9\u30dc\uff01","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/appfreelife.com\/?p=361","og_locale":"ja_JP","og_type":"article","og_title":"Deep Dive | \u201cLLMs Can See and Hear Without Any Training\u201d?\u00a0 My Skeptical Take on the MILS Paper - \u30a2\u30d7\u30ea\u526f\u696d\u30e9\u30dc\uff01","og_description":"TL;DR Meta proposed MILS (Multimodal Iterative LLM Solver), claiming that large language models (LLMs) can dir [&hellip;]","og_url":"https:\/\/appfreelife.com\/?p=361","og_site_name":"\u30a2\u30d7\u30ea\u526f\u696d\u30e9\u30dc\uff01","article_published_time":"2025-04-26T01:10:19+00:00","article_modified_time":"2025-04-26T01:13:44+00:00","og_image":[{"width":700,"height":700,"url":"http:\/\/appfreelife.com\/wp-content\/uploads\/2025\/04\/The-AI-Who-Couldnt-See-600-x-700-px.png","type":"image\/png"}],"author":"info@appfreelife.com","twitter_card":"summary_large_image","twitter_misc":{"\u57f7\u7b46\u8005":"info@appfreelife.com","\u63a8\u5b9a\u8aad\u307f\u53d6\u308a\u6642\u9593":"13\u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/appfreelife.com\/?p=361#article","isPartOf":{"@id":"https:\/\/appfreelife.com\/?p=361"},"author":{"name":"info@appfreelife.com","@id":"https:\/\/appfreelife.com\/#\/schema\/person\/642f91ef444c469d236f2a18aeec68d7"},"headline":"Deep Dive | \u201cLLMs Can See and Hear Without Any Training\u201d?\u00a0 My Skeptical Take on the MILS Paper","datePublished":"2025-04-26T01:10:19+00:00","dateModified":"2025-04-26T01:13:44+00:00","mainEntityOfPage":{"@id":"https:\/\/appfreelife.com\/?p=361"},"wordCount":800,"commentCount":0,"publisher":{"@id":"https:\/\/appfreelife.com\/#\/schema\/person\/642f91ef444c469d236f2a18aeec68d7"},"image":{"@id":"https:\/\/appfreelife.com\/?p=361#primaryimage"},"thumbnailUrl":"https:\/\/appfreelife.com\/wp-content\/uploads\/2025\/04\/The-AI-Who-Couldnt-See-600-x-700-px.png","articleSection":["\u30a2\u30d7\u30ea\u7d39\u4ecb\u306b\u3064\u3044\u3066"],"inLanguage":"ja","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/appfreelife.com\/?p=361#respond"]}]},{"@type":"WebPage","@id":"https:\/\/appfreelife.com\/?p=361","url":"https:\/\/appfreelife.com\/?p=361","name":"Deep Dive | \u201cLLMs Can See and Hear Without Any Training\u201d?\u00a0 My Skeptical Take on the MILS Paper - \u30a2\u30d7\u30ea\u526f\u696d\u30e9\u30dc\uff01","isPartOf":{"@id":"https:\/\/appfreelife.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/appfreelife.com\/?p=361#primaryimage"},"image":{"@id":"https:\/\/appfreelife.com\/?p=361#primaryimage"},"thumbnailUrl":"https:\/\/appfreelife.com\/wp-content\/uploads\/2025\/04\/The-AI-Who-Couldnt-See-600-x-700-px.png","datePublished":"2025-04-26T01:10:19+00:00","dateModified":"2025-04-26T01:13:44+00:00","breadcrumb":{"@id":"https:\/\/appfreelife.com\/?p=361#breadcrumb"},"inLanguage":"ja","potentialAction":[{"@type":"ReadAction","target":["https:\/\/appfreelife.com\/?p=361"]}]},{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/appfreelife.com\/?p=361#primaryimage","url":"https:\/\/appfreelife.com\/wp-content\/uploads\/2025\/04\/The-AI-Who-Couldnt-See-600-x-700-px.png","contentUrl":"https:\/\/appfreelife.com\/wp-content\/uploads\/2025\/04\/The-AI-Who-Couldnt-See-600-x-700-px.png","width":700,"height":700},{"@type":"BreadcrumbList","@id":"https:\/\/appfreelife.com\/?p=361#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\u30db\u30fc\u30e0","item":"https:\/\/appfreelife.com\/"},{"@type":"ListItem","position":2,"name":"Deep Dive | \u201cLLMs Can See and Hear Without Any Training\u201d?\u00a0 My Skeptical Take on the MILS Paper"}]},{"@type":"WebSite","@id":"https:\/\/appfreelife.com\/#website","url":"https:\/\/appfreelife.com\/","name":"\u30a2\u30d7\u30ea\u526f\u696d\u30e9\u30dc\uff01","description":"AI\u30a2\u30d7\u30ea\u958b\u767a\u3067\u3001\u81ea\u7531\u306a\u50cd\u304d\u65b9\u3092\u3064\u304b\u3082\u3046\u3002","publisher":{"@id":"https:\/\/appfreelife.com\/#\/schema\/person\/642f91ef444c469d236f2a18aeec68d7"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/appfreelife.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ja"},{"@type":["Person","Organization"],"@id":"https:\/\/appfreelife.com\/#\/schema\/person\/642f91ef444c469d236f2a18aeec68d7","name":"info@appfreelife.com","image":{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/appfreelife.com\/#\/schema\/person\/image\/","url":"https:\/\/appfreelife.com\/wp-content\/uploads\/2025\/04\/cropped-App\u526f\u696d\u30e9\u30dcicon2.png","contentUrl":"https:\/\/appfreelife.com\/wp-content\/uploads\/2025\/04\/cropped-App\u526f\u696d\u30e9\u30dcicon2.png","width":512,"height":512,"caption":"info@appfreelife.com"},"logo":{"@id":"https:\/\/appfreelife.com\/#\/schema\/person\/image\/"},"sameAs":["http:\/\/appfreelife.com"],"url":"https:\/\/appfreelife.com\/?author=1"}]}},"views":"3","_links":{"self":[{"href":"https:\/\/appfreelife.com\/index.php?rest_route=\/wp\/v2\/posts\/361","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/appfreelife.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/appfreelife.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/appfreelife.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/appfreelife.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=361"}],"version-history":[{"count":5,"href":"https:\/\/appfreelife.com\/index.php?rest_route=\/wp\/v2\/posts\/361\/revisions"}],"predecessor-version":[{"id":369,"href":"https:\/\/appfreelife.com\/index.php?rest_route=\/wp\/v2\/posts\/361\/revisions\/369"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/appfreelife.com\/index.php?rest_route=\/wp\/v2\/media\/362"}],"wp:attachment":[{"href":"https:\/\/appfreelife.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=361"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/appfreelife.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=361"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/appfreelife.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=361"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}