You've likely heard of ChatGPT. There are numerous articles and blogs online discussing its capabilities. If you have not, it is a chatbot from OpenAI that allows users to type in questions to which ChatGPT provides answers. The bot uses generative AI methods, including both supervised and reinforcement learning, to produce text-based responses. It is designed to be conversational, meaning it answers and often improves responses with additional interaction and dialogue. One task it performs is writing software programs.

Can it be used to write SAS® or Python code? Here at Magpie, we utilize both programming languages to build solutions that help healthcare organizations achieve their analytic objectives. Let's start with a SAS® example and see how ChatGPT performs:

"Write SAS code to determine the average age by gender using the SASHELP.CLASS dataset."

This is probably one of our favorite datasets in the SASHELP library. It is a small 19 observation dataset with five variables: Name, Sex, Age, Height, and Weight. We use this dataset to build examples for clients, staff, and for general testing.

ChatGPT responded with the correct answer:

ChatGPT even provided an explanation for the code, which is useful for documenting code:

"This code will create a new dataset called age_by_sex that contains the mean age for each sex in the sashelp.class dataset. The PROC MEANS statement is used to calculate the mean of a continuous variable, and the CLASS statement is used to specify the categorical variable by which the means will be calculated (in this case, sex). The VAR statement specifies the continuous variable for which the means will be calculated (in this case, age). The OUTPUT statement is used to save the results of the PROC MEANS procedure to a new dataset (in this case, age_by_sex), and the MEAN=mean_age option specifies that the mean of the age variable should be stored in a new variable called mean_age."
"This code will calculate the mean age for each sex in the sashelp.class dataset and store the results in a new dataset called age_by_sex. You can then use this dataset to analyze the average age by sex."

Next, we asked ChatGPT to:

"Write Python code to determine the average Age by Sex using the SASHELP class dataset."

We were curious to see how ChatGPT would answer the question. ChatGPT answered this prompt correctly ignoring a few minor required tweaks. See the code and explanation provided by the bot below:

"This code will calculate the mean age for each sex in the sashelp.class dataset and print the results."
"The pandas library is used to read in the sashelp.class dataset and to perform the groupby and mean calculations. The read_sas function is used to read in the dataset from a SAS binary file (.sas7bdat). The groupby function is used to group the data by the sex variable, and the mean function is used to calculate the mean of the age variable within each group."
"This code will calculate the mean age for each sex in the sashelp.class dataset and print the results. You can then use these results to analyze the average age by sex."

The minor changes required were: 1) the first letter in the "Sex" and "Age" variables needed to be capitalized and 2) the local path to "sashelp.class.sas7bdat" needed to be added to the read_sas call.

It was interesting to see ChatGPT utilize the read_sas method, recognizing that a SAS® dataset may be read directly in Python utilizing Pandas. We often utilize this method in projects where we move data back and forth between SAS® and Python - which is a use case we are seeing more and more often.

Next, we decided to provide ChatGPT more of a challenge. Given the bot is trained on a massive amount of internet data, perhaps a made-up example would be more of a challenge. There are many examples on the internet that use the CLASS dataset that ChatGPT can learn. A new dataset may give the bot some trouble.

Here is the next prompt we provided ChatGPT:

"Write SAS code to determine the name of the oldest patient using dataset A with variables Patient ID, Age and dataset B with variables Patient ID, Last Name, First Name."

Below is the initial SAS code that ChatGPT provided:

This code is not correct - perhaps we found the limits of ChatGPT. First, the PROC SORT BY statement applies the DESCENDING option to the variable that follows it, so "Age" and "descending" need to be swapped to avoid a SAS® error. Second, ChatGPT applied the BY statement and FIRST.variable approach in the last data step to age. This approach will keep the first observation for each age grouping, meaning if there are multiple patients 99 years old and multiple patients 65 years old, it will keep the first observation from each age. This will not find the oldest patients in the data.

Given, ChatGPT is conversational and purports to improve results with additional dialogue we resubmitted the prompt. One of the things we didn't like about the initial code suggested by ChatGPT, was the number of steps required. So, we asked a very simple follow up question:

"Is there a more efficient way to write the code?"

ChatGPT responded:

"Yes, there is a more efficient way to answer the question using a SQL join statement. Here is an example of how to do this in SAS:"
"This code will select the last name and first name of the patient with the maximum age in dataset A. It does this by performing an inner join between datasets A and B on the patient_id field, and then selecting only the row with the maximum value of age in dataset A."

Impressive - ChatGPT understood we wanted more concise code with less steps. It also recognized SQL via PROC SQL as an option within SAS®. The generated code utilized one step which we copied and pasted directly into SAS®, clicked run, and SAS® produced a dataset called oldest with 3 observations. The 3 oldest patients in our sample dataset were age 99.

Our initial hypothesis is that ChatGPT would produce better Python code than SAS® code. We were surprised to find that producing a Python solution to our sample use case appeared to be more of a challenge for ChatGPT, which produced several incorrect solutions. This is likely why Stack Overflow has temporarily banned answers generated by ChatGPT. Perhaps our programming jobs are safe for the near-term future.

It is important to remember that ChatGPT is designed to engage users in a dialogue, which means asking the bot follow-ups and rephrasing questions. One of the mistakes that ChatGPT made in both the SAS® and Python code was selecting a group by approach and not recognizing that there may be two or more patients who have the oldest age. The key to solving this use case is finding the maximum age and then finding all the patients with that same age. With this key in mind, we updated our ChatGPT prompt, and it produced a working solution:

"Write Python code to filter a dataframe to identify the oldest patient age and print the name of oldest patients. Use SAS dataset A with variables Patient ID, Age and SAS dataset B with variables Patient ID, Last Name, and First Name assume there are multiple patients with the same age."

So, can we use ChatGPT to write SAS® or Python code? The short answer is yes. However, coders need to be prepared to validate and experiment, especially as the use cases get more and more specific and esoteric. ChatGPT is fallible - so users need to be weary of the bot's responses. Despite this issue, we plan to experiment more with ChatGPT in our daily work. Our hypothesis is it may provide better potential coding prompts and solutions than simply searching Google, or minimally move us toward a working solution quicker.

Finally, when we need a break or a laugh from a long day of coding, we can have fun using ChatGPT by asking it to perform silly tasks such as:

"Write a few lines of Shakespeare about SAS and Python."
"Verse doth deceive, 'tis true, yet here I'll try
To spin a tale of SAS and Python's might,
Two rivals fierce, in fields of data war,
Their skills and tools, a wondrous sight to sight.
SAS doth reign, in mainframe halls of old,
A king in age, yet slow to adapt and change,
While Python, young and nimble, doth behold
A future bright, with ease in range.
Yet SAS is not a has-been, far from it,
For still it stands, a fortress, strong and sure,
And Python's youth doth not entail a wit
Less sharp, less able to endure.
Thus let us not in lies, truth to disguise,
But recognize each tool, in its own right, wise."

While ChatGPT may not write quite as well as Shakespeare, it produced a few lines that were good for a quiet laugh before we got back to work for the day.