← Blog
How ChatGPT apps write their tool descriptions

How ChatGPT apps write their tool descriptions

·MB Samuel

When ChatGPT decides which tool to call, the tool's description is the single biggest factor in that decision. The description tells the AI what a tool does, when to use it, and (ideally) when not to use it. Get the description right and the AI picks the right tool on the first try. Get it wrong and the AI sends users down the wrong path entirely.

We analyzed 886 tool descriptions across 147 third-party apps in the ChatGPT App Store. The results reveal a wide gap between what the best apps are doing and what most apps ship with. While 99.5% of tools have some kind of description, only 35.6% include positive usage guidance ("Use this when...") and just 26.1% include negation guidance ("Do NOT use this for..."). A mere 16.4% of tools include both. That means more than 83% of all tools in the ChatGPT App Store give the AI no clear guardrails about when a tool is the wrong choice.

The Description Quality Spectrum

Tool descriptions in the ChatGPT App Store range from 10 characters to nearly 15,000 characters. The average is 723 characters, but the median is just 283 characters, which means a large share of tools rely on descriptions shorter than a typical text message.

Description LengthToolsShare
No description40.5%
1-50 chars (very short)687.7%
51-100 chars (short)11112.5%
101-200 chars (medium)13715.5%
201-500 chars (detailed)27330.8%
501-1,000 chars (very detailed)14015.8%
1,001+ chars (extensive)15317.3%

Roughly 20% of all tools ship with descriptions under 100 characters. At that length, there is barely room to say what the tool does, let alone explain when to use it or when not to.

What Good Looks Like: Positive Guidance

About 36% of tools include some form of positive guidance, telling the AI when to use the tool. The most common patterns are:

PatternOccurrences
"When the user..." (user-scenario framing)116
"Use this when/to/for..."102
"If the user..." (conditional trigger)92
"Such as..." (inline examples)43
"This tool is for..." (purpose statement)38
"e.g." (example references)36

The best positive descriptions tell the AI both the purpose of the tool and the specific user scenarios that should trigger it. StubHub's event-search tool is a strong example:

Use this when the user wants to discover events, performances, games, matches, shows, etc. Use this tool to make high-intent searches for specific performer names, team names, event names, venue names, or other specific entities. Also use this tool to make low-intent searches for broad categories, genres, performance types, etc.

This description covers the purpose (discovery), the types of queries it handles (high-intent specific searches and low-intent browsing), and implicitly helps the AI distinguish this tool from StubHub's other tools (fetch-listings and highlight-listing) that serve later stages of the funnel.

Expedia's search_hotels description opens with a clear behavioral instruction:

ALWAYS invoke this tool for any message that includes or implies hotel/lodging search intent, initial or follow-up. Do not answer from general knowledge; call the tool again using the updated parameters.

That second sentence is the kind of detail that prevents a common failure mode: the AI answering from memory rather than calling the tool. Expedia explicitly blocks that behavior.

What Great Looks Like: Adding Negation Guidance

Only 26% of tools include negation guidance, and when you look at the specific patterns, the data shows how developers are using it:

PatternOccurrences
"Do not use..."77
"IMPORTANT:" (callout blocks)62
"Only for/when..." (scope limiters)60
"Do not call..."44
"Rather than..." (redirect to other tools)27
"Must not..."21
"Never use..."12
"Instead use..." (explicit cross-references)13

The most effective negation descriptions do not just say what a tool cannot do. They redirect the AI to the correct alternative. This is the difference between a dead end and a successful routing decision.

Malwarebytes: A Model for Consistent Description Quality

Malwarebytes stands out because every single one of its five tools follows the same description template. Each one opens with a positive trigger ("Use this when you need to..."), explains what the tool returns, and closes with an explicit exclusion. Here is check_link:

Use this when you need to check if a link or URL is safe, suspicious, or malicious. [...] Do not use this for general web searches, content fetching, or webpage analysis.

And check_email:

Use this when you need to check if an email address is associated with phishing, scams, or malicious activity. [...] Do not use this for email validation, mailbox verification, or general email lookup services.

The pattern across all five tools is consistent: positive trigger, return values, and a clear exclusion. When every tool in an app follows the same structure, the AI can reliably route between them because each tool clearly defines its boundaries.

realestate.com.au: The Most Thorough Exclusion List

At 9,745 characters, realestate.com.au's property.search_for_sale description is one of the longest in the store, and much of that length is devoted to saying what the tool should not be used for:

DO NOT use this tool for:

  • Financial advice (e.g. loans), legal advice (e.g. state legislation, trusts, conveyancing steps), taxation advice (e.g. stamp duty minimisation strategies), investment advice, or other specialist advice; instead, decline and direct users to qualified professionals.
  • Affordability / borrowing-capacity guidance (e.g. "what can I afford", "based on my salary/income", "how much can I borrow"). Do NOT estimate a budget from salary, savings, repayments, deposits, interest rates, or any prior chat context.

This level of detail reflects the regulatory environment that real estate and financial services companies operate in. The description does not just prevent wrong tool selection; it prevents the AI from offering guidance that could create legal liability. The description goes on to specify that if a user has not stated their own numeric price range, the tool should not be called at all, and the AI should instead ask the user to provide one.

Streak: Compact but Complete

Streak proves you do not need thousands of characters to write effective descriptions. With 13 tools averaging 466 characters each, Streak includes both positive and negative guidance on nearly every tool. Here is get_pipeline at 245 characters:

Use this when you need detailed information about a specific pipeline, including its stages, custom fields/columns, saved views, and users with access. Do not use for: listing all pipelines (use list_pipelines) or searching by name (use search).

That is 245 characters covering what the tool does, when to use it, what not to use it for, and which alternative tools to use instead. The explicit cross-references ("use list_pipelines", "use search") are particularly effective because they give the AI a direct redirect when it is considering the wrong tool.

Streak's assign_box follows the same pattern at 201 characters:

Assign a box to users by email address, or unassign all users by passing an empty string. Use this when: assigning, reassigning, or unassigning a box. Do not use for: viewing box details (use get_box).

Every tool in Streak's integration reads like a routing table entry: here is what this does, here is when to use it, here is what to use instead.

The Best Apps in the Store

Apps that score highest on description quality combine three things: high coverage (descriptions on every tool), positive guidance (telling the AI when to use each tool), and negation guidance (telling the AI when not to). Here are the top performers:

AppCategoryToolsPositiveNegationBothAvg Length
ExpediaTravel2100%100%100%4,032 chars
StubHubLifestyle3100%100%100%775 chars
UberTravel2100%100%100%7,058 chars
realestate.com.auLifestyle1100%100%100%9,745 chars
MalwarebytesLifestyle5100%100%100%865 chars
DoorDashFood2100%100%100%1,048 chars
Uber EatsFood2100%100%100%595 chars
ViatorTravel2100%100%100%1,294 chars
StreakProductivity1392%100%92%466 chars
CanvaDesign875%63%50%2,718 chars

Note the pattern: Travel and Food apps dominate the top of the list. These are high-stakes categories where calling the wrong tool wastes the user's time and produces irrelevant results. When a user says "find me a hotel in Paris," the AI needs to pick the hotel search tool, not the flight search tool, and it needs to do it reliably.

Streak is the standout in the Productivity category because it manages to include both positive and negation guidance across 13 tools while keeping descriptions compact. Most Productivity apps in the store do not come close.

The Worst Patterns

On the other end of the spectrum, some tools ship with descriptions that offer the AI essentially zero guidance.

The bare minimum. Ramp's ramp_list_cards has a 10-character description: "List Cards." Ramp's ramp_activate_card is 13 characters: "Activate Card." These descriptions repeat the tool name with slightly different formatting and add nothing the AI could not already infer from the tool name itself.

Generic one-liners. DataCamp's tools average 37 characters across six tools, with descriptions like "Get a tutorial by its ID." and "Get a webinar by its ID." These tell the AI what the tool does at a mechanical level but provide no context about when or why the AI should call them.

Missing entirely. Asana is the only app in the store with tools that have no description at all. Four of its tools (log_widget_event, create_project_confirm, create_task_confirm, create_project_confirm_populate) ship without any description, leaving the AI to rely entirely on the tool name.

Short but misleading. Experian Insurance's get_auto_insurance_estimates_by_zip has a 24-character description: "Insurance Estimates List." Despite having one of the most descriptive tool names in the store, the description actually reduces clarity. The tool name says "auto insurance estimates by zip," but the description drops all that specificity and just says "Insurance Estimates List."

The common thread among poor descriptions is not just brevity. It is the absence of decision-making context. The AI does not need a description to know that ramp_list_cards lists cards. It needs the description to know whether listing cards is the right response to a given user query, and what the user might be trying to accomplish when they ask about their cards.

How Exclusions Prevent Wrong Tool Selection

The reason negation guidance matters so much is that apps with multiple tools create ambiguity. When an app has two, five, or 20+ tools, the AI needs to choose between them for every user message. Positive guidance ("use this when...") helps, but it does not prevent the AI from matching a query to multiple tools that all sound relevant.

Consider StubHub's three tools: event-search, fetch-listings, and highlight-listing. Without negation guidance, a user saying "show me Taylor Swift tickets" could reasonably trigger any of them. The AI might pick fetch-listings (which requires an event_id it does not have yet) or highlight-listing (which requires both an event_id and a listing_id).

StubHub prevents this with explicit exclusion guidance on each tool:

On fetch-listings:

DO NOT use this tool without first performing an event-search to get the event_id.

On highlight-listing:

REQUIRES both event_id and listing_id from previous tool calls.

These descriptions create a clear sequence: search first, then listings, then highlight. The positive descriptions say what each tool is for, and the negation guidance prevents the AI from skipping steps.

DoorDash solves a similar routing problem with just two tools. Its doordash_checkout description explicitly says:

DO NOT call this tool directly from the model. This tool is only accessible through the shopping cart widget UI. To help users complete a purchase, first use create_product_list to build the cart.

Without that exclusion, the AI might try to call doordash_checkout directly when a user says "buy some groceries," which would fail. The description routes the AI to the correct entry point.

Category Patterns

Description quality varies substantially by category:

CategoryToolsWith PositiveWith NegationWith BothAvg Length
Entertainment850%100%50%1,332 chars
Travel2070%60%55%1,847 chars
Food1776%53%53%664 chars
Lifestyle6363%43%35%1,355 chars
Design5753%44%30%1,151 chars
Business13035%25%15%1,028 chars
Finance19635%19%9%612 chars
Shopping5244%19%15%650 chars
Developer Tools8815%11%6%260 chars
Collaboration7010%9%4%518 chars

Travel and Food apps write the longest, most guidance-rich descriptions. This makes sense: these categories have high user expectations, complex multi-step workflows (search, then filter, then book), and significant cost if the AI picks the wrong tool.

Developer Tools and Collaboration apps sit at the other extreme, with the shortest descriptions and the lowest rates of positive and negation guidance. Developer Tools average just 260 characters per description, and only 6% of their tools include both positive and negative guidance. Collaboration tools are even lower at 4%.

The gap is substantial. A Travel app tool is roughly seven times as long as a Developer Tools description on average, and ten times more likely to include both positive and negative guidance.

A Framework for Writing Better Descriptions

Based on what the best apps in the store are doing, here is a practical template for writing tool descriptions that help the AI make good decisions.

1. Start with the trigger. Open with "Use this when..." or "Call this tool when..." followed by the specific user scenarios that should invoke this tool.

Use this when the user wants to check if a link or URL is safe,
suspicious, or malicious.

2. Explain what the tool returns. Briefly describe the output so the AI can determine whether this tool's response will satisfy the user's request.

Returns one of: malicious, suspicious, safe, or unknown,
based on threat intelligence database.

3. State the exclusions. List what the tool should NOT be used for, and redirect to the correct alternative.

Do not use this for general web searches, content fetching,
or webpage analysis.

4. Cross-reference related tools. If your app has multiple tools, explicitly name which tool to use instead.

Do not use for: listing all pipelines (use list_pipelines)
or searching by name (use search).

Streak's approach proves this does not have to be long. You can cover all four points in under 250 characters. The key is being explicit about boundaries, not about length.

For apps with sequential workflows (search, then select, then act), make the sequence explicit in each tool's description. StubHub, DoorDash, and Viator all do this well, treating each tool description as one node in a routing graph that tells the AI where to go next.

Methodology

This analysis covers 147 third-party apps in the ChatGPT App Store as of February 2025. We excluded integrations built and maintained by OpenAI (like GitHub, Linear, Slack, and Google Workspace) to focus on apps that companies built and shipped independently.

Descriptions were analyzed using pattern matching for positive guidance signals ("Use this when," "If the user," "Such as," and similar patterns), negation signals ("Do not use," "Never," "Only for," "Instead use," and similar patterns), and user scenario references. Apps were scored on a weighted composite of description coverage, positive guidance presence, negation guidance presence, and description length. Verbatim descriptions are cited directly from tool metadata.

Want access to the full dataset? Contact us to learn more.