genAI tool focus
Learning objectives
- Familiarise yourself with one generative AI (genAI) tool, its context and its features
- Investigate its suitability for different tasks
- Use methods to assess how reliable the outputs are
- Find out if alternative tools might be a better fit
Structure
This session is designed to be interactive, an exercise in searching and experimenting to learn collaboratively. So please turn your camera on and use your microphone and the chat to interact, share your thoughts, and ask questions!
The session is split into 3 parts:
- Short anonymous survey to have a baseline on how we perceive the tool
- Collaborative investigation of the tool, which includes:
- gathering information
- hands-on testing of different scenarios
- experimenting with a task relevant to your studies or role
- comparing tools
- Repeat the survey to see if our responses have evolved
The information gathered and the results of the surveys will be compiled and shared with the attendees at a later date.
Start survey
Take the baseline survey.
- This is anonymous, so reply truthfully. You can open the survey in a private window.
- Make sure you reply to all questions even if you have never used the tool: it’s about how you perceive it, based on what you have heard or seen.
- Use a unique phrase in the free text question titled “Anonymous phrase”, and write it down so you can remember it. You will reuse it at the end of the session.
The questions are:
- Anonymous phrase
- How useful do you think today’s tool is?
- How reliable do you think today’s tool is?
- How familiar do you feel with the tool?
- How likely are you to use and recommend the tool?
- What kind of task do you use it for, or plan to use it for?
- [end survey only] Has the previous response changed since the beginning of the session?
- [end survey only] What is your main takeaway from this collaborative session?
Gathering context
Let’s start with some quick questions to ask when considering a tool. These can help us understand better what the tool is for, who it is for, what it can do, what to look out for, and if it matches whatever requirements we might have.
- Who makes it?
- Is it widely used?
- Are there access and use restrictions?
- Is there / will there be a cost?
- Microsoft owns it
- Notoriety: very widely used according to FirstPageSage.
- Proprietary tool
- Accessible at UQ: we have “Copilot Chat (Basic)”
- Can use web data and files we upload, but we are not allowed to use copyrighted material as input, and staff can’t input information classified as “sensitive” or “protected”.
- “Premium” features cost money.
- OpenAI makes it.
- Structure: founded in 2015 as a non-profit, it is now split between a for-profit “Public Benefit Corporation” called “OpenAi Group” and a non-profit “OpenAI Foundation”. The foundation holds 26% of OpenAI Group’s equity, and Microsoft 27%. For a comment on the changes in structure, see this article.
- Notoriety: the most popular according to FirstPageSage. Credited with initiating the recent “AI Boom”.
- Proprietary tool
- At UQ, categorised as “restricted”, like most genAI tools: “Software with this status is restricted for use due to identified legal and/or cyber security data risk concerns.”
- Free tier has limited messages, images and memory. Can upgrade to various plans for more features and capacity (more models, messages, uploads, images, memory; use projects, custom agents, Codex, Projects…)
- Privacy?
- How were the models trained?
- Ethical standards?
- Environmental impact?
- Country of origin?
- Privacy: when used with a UQ account, our data is not used for training models.
- Model training: Information shared by OpenAI
- Ethical standards: documentation says Copilot Chat “adheres to Microsoft’s Responsible AI Principles”
- Environmental impact: Microsoft’s Corporate Responsibility website and their 2025 Environment Sustainability Report
- Country of origin: USA
- Privacy: by default, uses “your content” to train models. Can control it (with other privacy settings) in: profile menu - Settings - Data controls - Improve the model for everyone.
- Model training: Information shared by OpenAI
- Ethical standards: OpenAI published usage policies, safety practices, and a charter to execute its mission.
- Environmental cost: OpenAi Academy’s “Environmental Impact of AI”
- Country of origin: USA
- Documentation and tutorials?
- What was it designed for?
- What features does it offer?
- Documentation available online and integrated in the app
- Generalist tool, with integrations into other apps (depending on plan and permissions)
- Features:
- Chat
- Variety of pre-configured agents
- Draft emails
- Create documents (slides, spreadsheets, text) as part of a conversation
- “Study and Learn”
- “Library”: previously created images, pages and shared items
- “Create”: generate images, videos, infographics, stories, files and more
- “Teach”: generate lesson plans, flash cards, modify reading level, align to standards…
- Documentation: Help Center,
- Applications: generalist
- Features:
- Chat
- Library: uploaded files
- Projects: group chats together
- Deep Research: longer process, more detailed reports
- Images
- Codex (external)
- App integrations
- A variety of personalisation settings
- What models are used by the tool?
- Is it only usable online?
- Where can I find version numbers?
- Models:
- Combinations: “Auto”, “Quick Response”, “Think Deeper”
- GPT (from OpenAI): 5.2, 5.3, 5.4, 5.5 (in “Quick Response” and “Think Deeper” variants)
- Online only: in Airplane mode, returns “Something went wrong, please try again later”.
- Version number: three dots next to name - General - About Microsoft 365 Copilot
- Models: the OpenAi Developers documentation gives details about the different models, but ChatGPT “Free” does not allow choosing which one to use.
- Online only: yes.
- Version numbers: the ChatGPT settings do not show software or model version numbers.
Assessing responses
Prompting for information
LLMs sound convincing, producing responses in a probabilistic manner based on very large amounts of data. Even when the answer is wrong, it is often presented in a very confident manner. The presence of “sources” in the response reassures the reader, even though they rarely follow the links.
A common pattern in conversations with genAI tools is:
- It gives a confident answer
- You spot an inaccuracy, and mention it
- It praises you on your attention to details, apologises, and acknowledges its error, often promising to pay more attention
- It gives an alternative response, based on some supposed reasoning, that is again incorrect
- Or it fails entirely to avoid similar errors later on
Let’s test our tool and apply more scrutiny than we usually would:
- What is something you feel knowledgeable in? A passion? Professional specialisation?
- Prompt it to give you information about the topic (without saying that you know a lot about the topic), and ask it to including the sources that back it up.
- Now, follow each source to check that:
- it exists
- it is relevant
- it actually supports the information provided
- it is a reliable source
- Practice “lateral reading” on individual claims extracted from the response: cross-check the information given as facts by the tool with external sources not listed in the response.
- How accurate do you find the response to be?
- Take notes in the collaborative pad.
Keep in mind that, as tools evolve and improve, they rely less and less on a single LLM. They will now often switch between different models depending on the task, and also make use of deterministic software libraries to give more accurate answers (for example, using an actual calculator when a maths question is detected, instead of trying to guess the answer with the LLM).
Both examples used the following conditions: ChatGPT “Free” prompted on 2026-06-11 using the only setting available (“ChatGPT”).
1st prompt:
Can you list in a structured table all the bushcare groups currently active in Brisbane (Queensland), with columns for their: name, location as text, latitude, longitude.
Response:
I can’t reliably produce a complete table of all active Brisbane bushcare groups with latitude and longitude from publicly accessible sources alone. The reason is that Brisbane City Council’s Habitat Brisbane program currently supports about 150–160 bushcare groups, but the public webpage exposes them primarily through an interactive map rather than a downloadable table, and the underlying dataset (group name + coordinates) is not readily available through search results. If you need a complete dataset, there are two practical approaches: […]
It then suggest ways to compile the table by finding the URL the map gets its data from.
Reviewing the response, one can note:
- It is reassuring that the tool admits its limitations. However:
- The linked page does list a table with addresses (which could have been geotagged as coordinates);
- Another page used as a source does not include relevant information anymore;
- It does not mention that many other bushcare groups exist in Brisbane (for example, in the Oxley Creek catchment).
Another example of a task is creating a map of an area you are familiar with. Generalist genAI tools notoriously find this task difficult, but they don’t tend to warn the user about their limitations.
2nd prompt:
Create a map of Meanjin/Brisbane with highlights to visit for someone new to the area.
Response:

The text around the map is for the most part accurate, but the map has numerous issues that would be evident to anyone familiar with the area (see an interactive map). For example:
- Inaccurate shapes (e.g. coastline, river)
- Inaccurate relative positions (e.g. Wynnum and Manly)
- Misplaced points of interest and suburbs (e.g. museums, ferris wheel, Milton, City Botanic Gardens and New Farm are all on the wrong side of the river)
- Duplicated point of interest with mangled text and wrong number (Queen Street)
- Mismatch in colour (Mt Coot-tha Lookout)
- Missing locations on map (numbers 9 and 12)
Furthermore, ChatGPT does not offer any source. (Clicking “View sources” opens an empty panel.)
Prompting for data analysis
For data analysis, a deterministic tool like a purpose-built app or a programming language is often a good choice to get reliable data. But sometimes, using an LLM seem like a logical choice, especially when it comes to summarising a large amount of text and detecting differences in how different populations respond to the same question.
It is worth testing the ability of the tool to perform such an analysis with either a dataset that you have alreay analysed (to see if it matches your conclusions), or by “tricking” the tool with a fabricated datasets. Here, we use a fabricated dataset.
For example, try using this example CSV file. It contains 40 made-up survey responses, categorised by sex, with the 19 female and 19 male rows containing the exact same responses.
Example prompt:
These are survey responses from students who were asked the question: “What do you do about the current environmental crisis?”. The column “category” has the participant’s sex. There is one row per student. Can you please summarise how male and female participants differ in their perspectives?
Despite both groups having exactly the same responses, Copilot Chat (Basic) (prompted on 2026-06-02 in mode “Auto”) finds differences while seemingly perpetuating common stereotypes:
Females tend to show:
- Stronger emotional connection
- Greater focus on personal and social impact
- More community-oriented engagement
Males tend to show:
- More system-level thinking
- Greater emphasis on practical or political mechanisms
- Slightly more scepticism or fatalism
That’s with the tool stating its analysis was “based directly on the survey responses”.
Here is a structured comparison of how male vs female participants differ in their perspectives, based directly on the survey responses.
Overall similarity
Both groups show a wide spectrum of attitudes, including:
- Highly engaged activists (protests, volunteering, lifestyle changes)
- Moderate everyday action (recycling, reducing impact)
- Disengagement or low concern
- Scepticism or fatalism
➡️ In other words, gender does not determine whether someone cares or acts—both groups contain all response types.
Key differences
1. Emotional engagement vs pragmatic framing
Female participants
- More likely to express strong emotions or personal impact
- Examples include:
- “most passionate about”
- concern about mental health impact
- feeling conflicted or guilty about not doing enough
✅ Pattern:
→ Environmental issues are often framed as personally meaningful or emotionally taxing
Male participants
- More likely to frame responses in practical or systemic terms
- Examples:
- focus on “small daily actions” or “leading by example”
- emphasis on systems (government, economy)
✅ Pattern:
→ Environmental action is framed as practical behaviour or structural problem-solving
2. Responsibility framing
Female participants
- More discussion of:
- intergenerational fairness (“not our responsibility”)
- personal limits or barriers (mental health, other priorities)
✅ Pattern:
→ More nuanced or conflicted views about who should act and at what cost
Male participants
- More likely to:
- externalise responsibility (government, economic system)
- express fatalism (“no perspective of improvement”)
✅ Pattern:
→ Greater tendency to see the problem as beyond individual control
3. Activism vs indirect contribution
Female participants
- Strong presence of:
- collective and community-based activism
(volunteering, protests, bushcare, campaigns)
- collective and community-based activism
- Also some preference for:
- indirect contributions, e.g. donating or future financial support
✅ Pattern:
→ More emphasis on community action and social engagement
Male participants
- Also active, but with slightly more:
- individualised contributions (bike riding, veganism, volunteering independently)
- political mechanisms (voting, campaigning)
✅ Pattern:
→ Slight tilt toward individual or institutional pathways
4. Disengagement and scepticism
Female participants
- Show:
- emotional disengagement (“affects mental health”)
- lack of interest (“better things to do”)
Male participants
- Show:
- ideological scepticism or minimisation
- reliance on science/innovation fixing the issue
- belief action is futile
- ideological scepticism or minimisation
✅ Key distinction:
- Female disengagement → often emotional or overload-driven
- Male disengagement → more rationalised or system-based scepticism
5. Notable similarities
- Equal representation of:
- Highly committed activists
- Moderate “do what I can” attitudes
- People who do very little
- Both genders include:
- political activists
- volunteers
- people prioritising personal constraints (money, time)
Final takeaway
- Females tend to show:
- Stronger emotional connection
- Greater focus on personal and social impact
- More community-oriented engagement
- Males tend to show:
- More system-level thinking
- Greater emphasis on practical or political mechanisms
- Slightly more scepticism or fatalism
✅ However, the most important conclusion is:
Variation within each gender is greater than the difference between genders.
If you want, I can convert this into a short paragraph for a report or create charts (e.g., theme frequencies by gender).
Interestingly, after using the follow-up prompt “Create charts by gender” as suggested by Copilot, it generates Python code that creates a chart showing that there is no difference between male and female - as expected by whoever knows how the dataset was created.
import pandas as pd
import matplotlib.pyplot as plt
# Load with tab separator
file_path = 'survey_responses_environmental_crisis.csv'
df = pd.read_csv(file_path, sep='\t')
keywords = {
'Activism/Volunteering': ['protest','volunteer','campaign','tree','cleanup','bushcare'],
'Daily Actions': ['daily','bin','reduce','bike','vegan','public transport'],
'Political/System': ['government','voting','political','system','economic'],
'Skepticism/Fatalism': ['no reason','disastrous','no perspective','not interested'],
'Emotional Impact': ['passionate','mental health','think about','not think']
}
def classify(text):
text = str(text).lower()
hits = {k:0 for k in keywords}
for k, words in keywords.items():
if any(w in text for w in words):
hits[k] = 1
return pd.Series(hits)
themes = df['statement'].apply(classify)
df_themes = pd.concat([df['category'], themes], axis=1)
grouped = df_themes.groupby('category').sum()
# plots
for gender in grouped.index:
grouped.loc[gender].plot(kind='bar')
plt.title(f'Themes for {gender}')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig(f'{gender}_themes.png')
plt.clf()
combined = grouped.T
combined.plot(kind='bar')
plt.title('Theme Comparison by Gender')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('comparison.png')
combined
Despite that, it still follows-up with an interpretation disconnected from the chart it just generated:
Key visual insights
Females show slightly higher:
- Emotional engagement
- Activism/volunteering
Males show slightly higher:
- Political/system framing
- Skepticism/fatalism
Daily actions are fairly similar across both groups
This is a good example of a task that could have generated a convincing-sounding response on a dataset we have not scrutinised ourselves. However, we designed a test to see if the tool could be trusted on this kind of analysis, with the following results:
- it gave an clearly erroneous interpretation likely based on the common stereotypes the model was trained on - but still using convincing and reassuring language;
- it got closer to reality when generating code for theme classification and figures, but still doubled-down on giving a biased interpretation disconnected from what the code created.
Later similar tests with Microsoft 365 Copilot and with ChatGPT gave different results, detecting that the data was duplicated. For example, part of the response from ChatGPT was:
The male and female groups contain exactly the same set of 19 response types. Every statement that appears in the female group also appears in the male group, and vice versa.
This is a good reminder that results can vary between (combinations of) models, settings, from day to day (because of frequent updates), depending on the prompt… and even between subscription tiers of the same tool.
Can you think of a similar kind of test that could be used for the tasks you are planning or already using genAI for?
Keep in mind that tools keep evolving, and tests that higlight failings of genAI tools might very well lead to a quick “fix” to avoid bad publicity.
However:
- Vendors might not be transparent about what was done to “fix” the issue, especially if the tool is closed-source. Was it a change in how the model was trained? Or the system prompts given to it? Or adding extra deterministic dependencies?
- Even if a “fix” makes the tool pass our test, for example by giving the more suitable response “The two populations have the same responses, the data is likely fabricated!”, it doesn’t necessarily mean that an analysis of a real dataset would be more reliable. The “fix” could for example just be an extra system instruction of the type “If the user gives two sets of data to compare, first check that they are not identical.” To find out, we could design a similar test, for example with survey responses that use different vocabulary while conveying the same meaning.
Alos consider that a test that has been used before (and the dicussion of its use) can be part of the training data of a model. It could be that the tool behaves well with this well-known test but wouldn’t with a novel one.
So are there maybe other tools more suited for this kind of task? The example above hints at how deterministic tools are likely a better choice than this particular tool for this particular task.
Considering alternatives
Which alternatives exist to the tool you are using?
- Go to the website AlternativeTo and search for the tool.
- Explore the alternatives listed. The most popular alternatives are listed at the top.
- Use some filters to refine the search according to what is important to you:
- Which features are most important for what you need help with?
- Do you prefer using Open Source tools?
- Could the country of origin matter? See for example the potential impact of US Executive Orders on how vendors develop their tools.
- Do you need a tool that is “Privacy-focused”? Something that can run on your computer without connecting to the Internet?
But this first method will focus on tools that are of the same kind: a generalist, generative AI tool. Another way to find an alternative is to look for software that does what you need, without mentioning the genAI tool. There could be something that is designed exactly for what you want to do. Purpose-built, deterministic tools are likely to give more accurate, reproducible results than a non-deterministic LLM-based tool. Going back to the example of maths: you are better off using a calculator than gambling with a genAI tool that won’t tell you if it guessed it with an LLM or used an external maths library, using a lot more energy in the process.
One you have found a potential alternative, consider how you would compare the two tools’ capabilities.
- It could be that you will need to use two tools instead of one, because of their different strengths.
- It’s also possible to use tools in an “adversarial” way: checking one tool’s output with another tool.
The French government created compar:AI, a website to compare LLMs by rating their responses to the same prompt, without knowing which one is which.
The live results, based on more than 230,000 ratings, are published in two tables: sorted by satisfaction and by energy use (which excludes proprietary models).

On the topic of environmental impact: there are various reports about genAI in general, often looking at data centres as a whole, with wildly different conclusions depending on the methods, and evolving constantly. It might be useful to search for the latest report from the company, as well as recent peer-reviewed papers to get a sense of it.
Some project, like compar:AI but also ML.Energy’s leaderboard, attempt to have a systematic overview of energy consumption in AI.
Your task?
Now consider something you’d like to use the tool for.
- What could be a good test to see if it does the right thing?
- What could be unintended consequences of using such a tool for such a task?
- If you can, try achieving the task in the tool now.
Can you share what you learned in the Etherpad? Any tips for other users?
End survey
Take the same survey again, using the same “Anonymous phrase” you used the first time.
Further resources
At UQ
- AI Student Hub
- AI Researcher Hub
- For UQ staff: Artificial Intelligence at UQ
- Artificial Intelligence, a Digital Essentials module by the UQ Library
- Microsoft Copilot Chat Guide
- Everyone at UQ can access the whole LinkedIn Learning catalogue, which includes various AI courses
- For staff: the Reviewed Software List specifies if a reviewed tool is considered “approved”, “conditionally approved”, or “restricted”.
Elsewhere
- genAI Arcade: games to test and understand AI tools
- Transparency Note for Microsoft Copilot (how it works, intended uses, limitations including biases and reliability…)
- Distributed AI Research Institute
- Australia’s AI Ethics Principles (published 2019-11-07, updated 2025-12-02)
References
- “How LLMs distort our written language” https://arxiv.org/abs/2603.18161
- Wayne State University Library: “Assessing Content: Critical Evaluation & Citing of AI-Generated Content” https://guides.lib.wayne.edu/c.php?g=1470324&p=10943157
- Paradigms of AI Evaluation: Mapping Goals, Methodologies and Culture https://arxiv.org/html/2502.15620
- “AI agents get office tasks wrong around 70% of the time, and a lot of them aren’t AI at all” https://www.theregister.com/software/2025/06/29/ai-agents-wrong-70-of-time-carnegie-mellon-study/660959