The Office AI Science team is part of OPG. The team builds systems that are leveraged across M365 and especially within Word, Excel, and PowerPoint. The team’s recent projects have included: PPT Summarization, Audio Overviews (Podcast), SPOCK Eval, Data Pipeline, Natural Language to Office JS, and CUA.
PPT Summarization: The Office AI Science team built the first fine-tuned SLM within M365. The fine-tuned Phi-3 Vision SLM improved p95 latency of PPT Visual Summary feature from 13 seconds to 2 seconds, while maintaining quality (opens in new tab) on par with GPT-4o-v. The optimization resulted in 75 times fewer GPUs being used compared to GPT-4o-v and almost 9 times the number of PowerPoint users receiving a visual summary. The fine-tuned SLM also powers PPT Visual Q&A, making it both faster and cheaper. The team also introduced PPT Interactive Summary, which allows users to drill into visual summaries in more detail, leading to over 50% decline in thumbs down per 100k tries over 3 months, 30% interactivity clicking on chevron to go deeper, and a 17.6% increase in weekly return rate. The team is currently fine-tuning 4o-mini-vision with the goal of replacing remaining non-English traffic to GPT-4o-v with this smaller model and evaluating Phi-4 Vision for English.
Audio Overviews: The team is building the Audio Overview Skill that introduces a podcast-like experience for consuming documents and artifacts. The feature is currently in the dogfood phase for MSIT, with production rollout scheduled for May 7 onwards. Users will be able to generate Audio Overviews from App Chat entry points in Word Win32 & Web, Copilot Notebooks (including OneNote), and other apps like Outlook Web, OneDrive Web and ODSP Mobile. Latest human evaluation (opens in new tab) scores overall transcript quality for the single file audio overview at 4.08/5.00 compared to 3.76/5.00 for NotebookLM, and with automated evaluation (opens in new tab), the team improved the overall score from an initial 4.09 to 4.65 with a two-step design leveraging GPT-4o and o3-mini. More details, including evaluation against multiple files for the Copilot Notebooks scenario and gains from moving to GPT-4.1, can be found here (opens in new tab).
SPOCK (AugLoop Eval): In collaboration with AugLoop, the Office AI Science team developed several key features that enable agility in evaluating App Copilot scenario quality metrics. By the end of FY25Q3, 22 scenarios have been onboarded across Word, PPT, Office AI, and SharePoint, with Excel onboarding in-progress. The platform currently reliably runs 300 eval jobs and 30,000 tests daily. The automated scenario evaluation turnaround time compared to manual run has significantly decreased from days to 2-4 hours. SPOCK now supports intent detection, Leo Metrics, BizChat 1K Query, Python, and Typescript customer evaluators; model swap and FlexV3 eval are coming in Q4. Additionally, the v-team is automating the App Copilot Quality Dashboard (ÆVAL – Copilot Evaluation (opens in new tab)), providing a comprehensive overview of the quality of App Copilot scenarios.
Data Pipeline: The team also created an online, self-serve, on-demand ADF pipeline for mining Office documents from the internet. This allows partners to kick off large-scale data mining jobs for specific languages and document types and features custom metadata extractors for extracting task-dependent document representations. By leveraging Bing’s precrawled 40B URL RetroIndex, document discovery is fast and efficient. OAI Science and several partner teams (Word+Editor, PPT Science, Word Designer, Designer, MSAI) are already utilizing the data for finetuning and test set creation.
Natural Language to Office JS: The Office AI Science team is working to finetune o* family model for common Office scenarios like inserting slides from another PowerPoint file, inserting headers and footers in Word, or creating and finding merged ranges in Excel.
CUA: The team also recently embarked on an exploration of Computer User Agent (CUA) centered on understanding user intent and adapting in real time. Leveraging plan assistance with the Office knowledge base, the team approximately doubled the task completion rate against OSWorld PPT scenarios. The team is working on fine-tuning the CUA model to improve task completions for Office apps.
For more contact: Amanda Gunnemo or Vishal Chowdhary