An increasing number of organizations are converting longstanding SAS® code into open-source languages such as Python. There are many reasons organizations are migrating from SAS® to open source. Regardless of the rationale, each conversion project will be challenged to “find the right words” so-to-speak to translate SAS® into the chosen open-source language. Doing this effectively involves leaning on the experience of your team, researching documentation, and experimenting through trial and error.
During a recent conversion of the CMS Hospital Compare SAS® package into Python, a team of SAS and Python developers unearthed a very specific difference in how SAS® and Python handled the same operation.
Specifically, consider the following SAS® code below:
Here proc fastclus is used to implement the well know k-means clustering algorithm to assign Overall Star Ratings to each hospital. In the first call of proc fastclus, SAS outputs the initial cluster assignment for each observation and the distance from cluster center. In “Step 2” proc fastclus is called a second time and precipitated an in-depth analysis of the last argument: strict=1. The experienced SAS® developers on the team researched the documentation, and leaning on prior work, were able to determine that the “strict=1” was essentially a parameter to remove any outliers in the clustering, which were more than 1 standard deviation away from the minimum distance in the clusters produced by the first run of proc fastclus.
The Python conversion leaned on the well-known scikit-learn machine learning library (sklearn). The K-means function from sklearn is the closest and best Python analog to the SAS® proc fastclus.
Taking a glance at the function’s arguments, it’s apparent there is no one-to-one “strict” argument comparison in the K-means function. In converting this SAS® code to Python, the first attempt yielded very similar results between the two. However, differences in the output between the SAS® and Python implementation remained.
With this understanding in place, and considering the analysis of the SAS® code regarding “strict=1”, the Python implementation was adjusted with the following lines of code:
After identifying and dropping “outliers” exceeding 1 standard deviation, there was one last tweak to make in the Python implementation. There is a parameter in the sklearn K-Means function called “tol”. This parameter specifies the stopping point or tolerance, for the clustering algorithm. Through experimentation, it was found that adjusting the “tol” argument from its default value to 0 allowed the sklearn K-means to match the SAS® proc fastclus output exactly. See the final Python K-means implementation below:
By matching the equivalent parameters between the sklean K-means and the SAS® proc fastclus, the few lines of Python code added to account for the effect of "strict=1" were able to complete the translation of SAS® to Python to match perfectly.
Code conversion is a complex and nuanced endeavor. Being able to leverage experience and documentation, as well as experimentation, go a long way in gaining the understanding needed to convert code successfully. Our team was equipped with diverse and complimentary skillsets which fostered the collaboration needed to solve this tricky problem.
We hope you enjoyed reading about our experience in code conversion and that our findings may help guide or inspire your own code conversions. Contact us for questions surrounding code conversions like this one. Our experienced team of experts are able to assess and implement solutions tailored to your business needs.