Manas Pathak

Experience

Software Developer Intern · IBM

Summer 2026 · New York, NY

Building agent infrastructure for LangGraph Deep Agents on OpenShift, enabling skill-based orchestration across AI workflows.

Machine Learning Researcher · HUMAIN Lab, UT Austin

Mar 2025 — present

LLM evaluation and reliability with Prof. Leqi Liu. Built CUDA-parallel PyTorch and vLLM pipelines that cut evaluation runtime 70% across 100+ models, a distributed GPU job scheduler on AWS that saved ~$10k in cloud costs, Chain-of-Thought hallucination detection for Qwen/Gemma-7B, and React/FastAPI dashboards tracking 50+ distributed jobs.

Undergraduate Department Tutor · ECE 312, UT Austin

Jan 2026 — May 2026

Mentoring 40+ students in C/C++ memory management, recursion, and graph traversal, with GDB and Valgrind debugging sessions.

Software Engineering Intern · Graph Neural Networks Lab, UT Austin

Dec 2024 — Mar 2025

Pipeline orchestration and monitoring with TypeScript, React, and PostgreSQL, plus vectorized geometric tooling and performance profiling in Python.

Software Engineering Intern · ModHeader

Jun 2023 — Oct 2023

Full-stack work for a 250k-user browser extension: Svelte UI, Node.js APIs on DynamoDB, and an analytics dashboard in React and TypeScript.

Research

arXiv:2604.11996 under review, COLM 2026 first author

Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces

Manas Pathak, Xingyao Chen, Amy Zhang, Leqi Liu

Accuracy treats every answer the same, but deployed systems act on the outputs a model is most confident about. FRS evaluates reasoning quality on exactly those traces, conditioning the metric on the model's own confidence. The result is a view of reliability that accuracy alone cannot provide: two models with identical scores can behave very differently when you only trust what they are sure of, and FRS makes that difference measurable.

Projects

CyberWise Seniors

Fraud-detection platform shielding seniors from phone, SMS, and email scams. A Spring Boot microservice wired to a BERT classifier returns risk scores in <200 ms; Twilio webhooks and a React dashboard cut scam successes 60%.

JavaSpring BootBERTTwilioReact

Order Smarter

A Spring Boot orchestrator that scrapes Uber Eats group-cart links, predicts calories with a Python NLP model, and returns structured nutritional insights through a REST API.

JavaSpring BootPythonNLPREST

WanderWise

Full-stack travel planner for building and sharing itineraries. JWT-secured CRUD over Django REST and PostgreSQL, OAuth2 APIs with rate limiting and Swagger docs, and a React + Tailwind front end on the Google Maps API.

ReactDjango RESTPostgreSQLGoogle Maps

Mavs Draft Tool

A scouting tool built for the Dallas Mavericks to filter and compare prospects for the 2025 NBA draft, turning raw player data into rankings the front office can act on.

TypeScriptReactData

Education

The University of Texas at Austin

Expected May 2028

B.S. Electrical & Computer Engineering, Software Concentration · GPA 3.96 / 4.00

Relevant coursework

Distributed Systems Deep Reinforcement Learning Operating Systems Machine Learning Computer Architecture Data Structures & Algorithms Discrete Math Software Design & Implementation I/II

About

I work on AI systems, with a focus on agents. As agents take on longer horizons and more autonomy, the hard problem shifts from making them capable to making them legible: tracing what they did, why they did it, and whether it worked.

That is the thread through my work. My research builds evaluation methods that condition on a model's own confidence, so a score reflects behavior you would actually act on. My engineering builds the observability layer for agent workflows, instrumenting how agents select and compose skills. Capability claims should be verifiable. That is how we trust progress.

Degree: B.S. ECE, May 2028
GPA: 3.96 / 4.00
Honors: Undergraduate Research Fellowship ($10,000) · Tau Beta Pi · Eta Kappa Nu
Stack: Python, C/C++, Java, Rust, Go, TypeScript · PyTorch, CUDA, JAX · Kubernetes, AWS, Terraform