Category: Uncategorized

  • Protected: Research Strategy for BenchSci – (AI for pharma R&D)

    This content is password protected. To view it please enter your password below:

  • AI for User Research @ BenchSci: Tactics that worked.

    AI for User Research @ BenchSci: Tactics that worked.

    TLDR: For our use: maybe 5% more speed with acceptable quality. Biggest benefit: coding for spreadsheets / data analysis.

    Context –

    until recently I ran user research at * (check my linked in)  and mentored designers and product managers. We put heavy emphasis on effective decision support for product strategy (building the right things) in addition to the usual kinds of design/PM support. I’ll try to limit this to user research and not general productivity. Doctrinally we followed the principles of continuous discovery / validation with a small user base. 

    Rules of the road for the Research team

    AI helps if it accelerates some balance of speed or quality while respecting the rules.

    • If you bring data to stakeholders too late, it’s not helpful.
    • If the data you bring is misleading, its not helpful.
    • numbers speak powerfully, but avoid misleading quantification.
    • avoid jargon- always assume you speak to a design/product outsider with no interest in the finer points of CIs and statistical significance.
    • stakeholders only tend to check work when the conclusions are inconvenient and trust it when they agree with it. Have your homework ready.

    We Explored AI for

    • rapid prototyping
    • synthetic users
    • data analysis
    • automating review/extraction from recorded calls
    • LLM as judge for design evaluation
    • writing for UX
    • general writing
    • perf reviews
    • making historical research accessible.

    Not Explored

    We did not explore uses where we already had reliable and effective automation or absolutely required human oversight (recruiting /compensation pipeline)

    Where LLMs really helped: 

    1. Writing for UX. Designers are often tasked with key bits of text in the UI. As a group, designers do not write well. This was a big win for new products/features.
    2. For us, 1 in 10 studies demanded low visual fidelity, highly interactive prototypes, with domain accurate content. Figma sucks at that, particularly tabular data and data visualizations. Vibe coding really helped that. This benefited new / early ideas more than shipped features in refinement/optimization. When it made the difference, it really made a difference.
    3. Spreadsheet like a pro. 
      There was always some project that required fusing different data, even for things like survey data fused to SFDC outputs. Advanced spreadsheet work is a kind of coding. The research team can do a little stats, we’re perceptive observers, and good interviewers. Not everyone on team can own spreadsheets like a financial analyst. This helped a lot with a small number of critical projects.
    4. Tone on general comms- 
      Used ‘LLM as judge’ on communications. For critical written company communications and things like performance reviews, LLMs definitely helped avoid gotchas— they didn’t improve speed or productivity but made sure poorly chosen words didn’t trigger morale or cause defensiveness. 
    5. Making historical research accessible 
      We tried a few approaches to this, starting with adding research reports to a notebookLM. In the end we found more consistent results by translating research reports to markdown and bullet pointing them for LLM. Getting the organization to go and ask was an institutional barrier we worked to overcome: PMs still preferred to use the researcher on team as a human encyclopedia in meetings. Senior leaders, when they chose to ask approached me directly.

    Shows some promise, but not there for us

    While applying just a foundation model might not work… there is much to try with hybrid approaches.

    1. Synthetic users for ‘attitudinal/behavioral’ questions
       (i.e. ‘ what are your biggest challenges with ‘x’ )Again, this is a mixed bag depending on how much underlying high quality data we had. It felt like we never got to critical mass – it works great when say a newly hired PM or designer asks an ‘old’ question where established data exists.  We used a pseudo-RAG approach (.md files accessed via cursor prompt against different models) It wasn’t as good as our control (‘ask a researcher thru slack’) in terms of quality, but response time was better.
    2. Getting insight from interview transcripts. Ideally automated insight.
      This is another one we really want to work, especially in an automated way. We had a lot of recorded conversations with customers well beyond research sessions. Both the services team and ours would love to get data off of them without the time cost of manual review. So we wrote some custom prompts as well as using dovetail’s evolving toolset. We saw improvement, but not enough maturity to save time: Commonly extracted data confused speakers (customer vs us), generalized away details, and depended on high quality transcripts. Bio research is a jargon heavy, with researchers from all over the world who have different command of English. It wouldn’t save us enough time while delivering accurate insight.

    Unhelpful with state of art in 2025

    We saw a good chance outputs may drive confidently bad decisions. Not confident that base foundational models will focus on these – it may require custom / external evaluators.

    1. LLM as heuristic judge for basic usability issues. 
      We really wanted this to work. This is a subcategory for synthetic users. In early designs you often find a lot of issues with ‘basics,’ well established heuristics.  We tried a mix of custom prompts around this , submitting single screens. The hope was to increase speed for designs in a similar way to unmoderated testing but with somewhat better precision and a lot more efficiency. The LLM tended to miss on visual and behavioral, focusing on words in the screens or screen images. For the big models this is a small dollar value problem, but feels like it needs big model energy to solve.
        
    2. General writing of reports for humans
      Thru trial and error we found the most effective reports for our company tend to be highly visual with very few words (also memes), accompanied by a short slack summary.  LLMs are great writers, but… summaries tended to miss the points we were trying to make.
    3. Study design -. In general it didn’t save time or improve quality.
    4. Agentic Interface Design – Compared to 20 years ago a lot of design today is not that experimental.  this might actually be both possible and effective with the right models and really mature  UX component systems to preserve visual and behavioral consistency. (Shopify has these building blocks, but others not so much) its one area where non-determinism in models could be a huge plus for ideation. I do think it requires augmenting foundational models with  models trained specifically around design concerns.I don’t think figma or canva will get there because at heart they still operate on well grouped, partially parameterized vector drawings. Not the way and engineered component works, or with knowledge of core cognitive psychology. 
    5. Generating personas or foundational research with just a foundation model search.
       When asking foundational models we got surprisingly accurate reads on how preclinical pharmaceutical research works at a high level. When asking models to create personas, the form was amazing but the data was only between 50% and 85% accurate when compared to our data. I would have big concerns that PMs and Designers using this data would make consistently bad decisions with confidence.

    One final thought: Don’t let AI isolate you from customers.

    Anytime you put a layer between PMs and users is a fail. Face to face interviews drive alignment far more than any research report. In my opinion listening to and observing real users and customers directly and frequently makes the biggest difference in product quality and fit. The more isolated PMs, execs, and designers are from customers, the worse their decisions become. Many AI tools for research position themselves as a timesaver in automating every thing from study design to final reports. They typically position to the haerried PM or the research manager.

    As it stands execs don’t often interact with customers on. Bad PMs certainly don’t.

    Apologies for the long rant, 
    Jeremy