EMU Research

Generative Artificial Intelligence in Research

What is Generative Artificial Intelligence?

Generative Artificial Intelligence (AI) is a blanket term for any technology that creates content (text, images, audio, code, etc.) based on parameters or guidelines (prompts) provided by a user.

Generative AI is not new; it has been around for decades in rudimentary forms. What is new is the speed and flexibility of these models. Newer forms of Generative AI effectively learn as they train on more data and the output becomes increasingly more sophisticated.

Using Generative AI

There are many different types of Generative AI. A brief description and guidelines for ethical use of each type follows.

Large Language Models (LLM)

LLM, such as ChatGPT, generate text based on input parameters in the prompt. Most LLM use millions of documents as training data and generate text word-by-word. These programs determine which word goes next by scanning training data and calculating the probability of all the possible words that follow the previous word and using the word with the highest probability.

Some LLM programs (e.g., ChatGPT) are provided with a large set of collected publications to use as training data. Because the training data are not continuously updated, the output text may contain outdated information. Alternatively, other LLM programs (e.g., Bing) scan websites as well, which are continuously added and updated, and can provide more recent results. However, websites are not often checked for veracity, so the training data may be neither true nor accurate. Training data used for any LLM may also contain explicit or implicit biases. It is important to understand the types of training data used in each LLM program before deciding which one to use. LLM are also notorious for fabricating information, also called "hallucinations," which is not true. For example, cited research and citations are sometimes incorrect. It is critical to review the output of any LLM to verify that all information in the output is correct and free of bias.

Image Generation Software

Image generation software uses an immense database of pictures with their captions as training data. The images used as training data are often collected from the internet, in particular, from websites in Western countries. Therefore, the output images created by image generation AI represent biases inherent in what is published on the internet.

As with LLM, it is important to be aware of the training data used by the software prior to using AI to generate images. Unfortunately, most software companies do not publicly disclose their training data. In the absence of known training data, the user must closely examine the output images to ensure that they are free of any explicit or implicit bias.

Generative AI-based Apps Used for Research Interventions

Generative AI can also be used as a research intervention. One example is a chatbot that is programmed to provide check-ins during participation to make sure the participant is following the intervention protocol and address any issues with participation. If Generative AI is used as a research intervention, there must be human eyes behind it, making sure that advice given and responses provided by the AI are ethical, not subject to bias, and would not harm participants. For example, if a researcher is studying the use of AI in providing individualized dietary information to subjects with type-II diabetes, it is the researcher's responsibility to ensure that the information provided by the AI is consistent with medical guidelines, that the AI is not producing hallucinations, and that the advice is free of potential implicit bias. These studies can be ethically complex, and it is highly recommended that additional procedures be put in place to protect participants.

Responsible Use of Generative AI

Again, Generative AI can be a useful tool when conducting research. As with any tool, it is important to use it ethically. In addition to providing transparency, detailed reporting of Generative AI use will enhance the reproducibility of the work.

Always review output for veracity, accuracy, and bias.
Honestly report the following:
- Generative AI program used
- How it was used
- In which parts of the study and manuscript it was used
- The prompts used, word for word.

Areas of Caution when Using Generative AI

Bias

While the algorithms that are used in Generative AI are agnostic, the data that the algorithms use for training are, of course, created by people and may contain opinions, interpretations, and biases in some form. These opinions, interpretations, and biases would then be preserved in the output. Responsible use of Generative AI involves learning as much as possible about the types and sources of training data used and reviewing the output very carefully to ensure that it is as objective as possible. Unfortunately, most companies do not make the training data publicly available, so researchers must do the best they can and closely examine the output. As the saying goes, Garbage In, Garbage Out.

Research Integrity Implications

Plagiarism. There has been a robust conversation around whether or not the use of Generative AI constitutes plagiarism. Plagiarism is the appropriation of another's work or ideas without giving appropriate credit. Because Generative AI is an algorithm and does not have ideas, it is open to interpretation whether or not Generative AI can be plagiarized. However, Generative AI does create content and should always be cited or otherwise acknowledged for its role. Again, it is critical for the researcher to review the output of LLMs to safeguard against the possibility of the output containing plagiarized content.

Fabrication/Falsification. It is more likely that the output of Generative AI will contain fabricated information (e.g., hallucinations) or falsified information (see the example from the Washington Post above). Closely reviewing the output will catch any fabrication or falsification before the material is published or otherwise communicated.

Legal Concerns

There are potential legal implications surrounding the use of Generative AI. It is currently unclear whether the use of copyrighted material as training data and incorporation of copyrighted material in output constitutes fair use (legal) or is a violation of copyright (illegal). Cases are currently making their way through the legal system.

In addition, there are confidentiality concerns. If the prompt includes uploading a document (e.g., "Provide a 200 word summary of the following text."), that uploaded document becomes publicly accessible as a part of the Generative AI system and may be incorporated into training data, depending on the algorithm. Documents that contain personally identifiable information, sensitive information, confidential information, legally-protected information (e.g., trade secrets and copyrights), or have restrictions on their use and sharing (e.g., grant applications or journal manuscripts provided to reviewers) should NEVER be uploaded or copied into Generative AI programs; to do so will violate confidentiality agreements and could have legal or professional consequences.

Citing Generative AI

Many style manuals have developed standards for citing Generative AI in research papers. Links to citation standards follow:

American Psychological Association (APA) Links to an external site.

Modern Language Association (MLA) Links to an external site.

Chicago Manual of Style Links to an external site.