When you’re first starting out with data science, you will likely have many questions. However, one question that seems to come up quite frequently is: Which language should I choose for data science?
Data scientists generally use either R or Python for the majority of all data science work. In addition, these two languages are roughly equal in terms of their popularity. So, the question effectively becomes: Should I choose R or Python for data science?
Each of these two languages has various pros and cons. In addition, there are often situations where one language is more effective than the other. While I’d love to say there is a one-size-fits-all answer to this question, unfortunately, the only real answer is: It depends.
The more important question, in my opinion, is: When should I choose R vs. Python for data science?
To help you decide which language to choose for each situation, we’ll discuss several scenarios where one language might be more advantageous than the other. This information is based on my own experience over many years using both of these languages for data science.
When it comes to scraping websites, calling APIs, or reading data files, I generally prefer using Python. I started out as a software developer, so I find Python’s interaction with the web, APIs, and file I/O to be very clean and intuitive. However, when it comes to connecting to random data sources, R has the widest support I’ve ever seen in any programming language.
R has very powerful tools for slicing, dicing, transforming, and cleaning data. So, I generally prefer using R for most data-munging tasks. In addition, because almost 80% of the work involved in any data science project is data munging, R is generally my go-to language for most data science work. However, Python has been rapidly catching up with R in the data-munging space over the past few years.
R was created by statisticians, for statisticians, to perform data analysis. So, R has a significant advantage over Python in this department. Any time I have a scenario where the bulk of the work will involve data analysis, I will almost always choose R over Python. In addition, given how much time is spent with exploratory data analysis, once again, R is generally my go-to language for most data science work.
Both R and Python have very powerful data visualization capabilities. They both have tools for creating quick and dirty charts and graphs or more advanced data visualizations. However, I prefer creating data visualization with R using RStudio because I can interactively develop the data visualizations in the IDE. Every Python IDE I’ve tried forces me to use pop-up windows.
Both R and Python have excellent support for machine learning. Packages like caret (in R) and scikit learn (in Python) allow you to automate much of the heavy lifting involved with training, testing, and evaluating machine-learning models. In addition, they both have hundreds of ML algorithms to choose from. However, Python is the clear winner for deep learning with frameworks like TensorFlow.
There are various ways to deploy either R or Python code into production. For quick executable scripts, Python is my first choice. For simple web-based data apps, I use R with Shiny. For reproducible research, I use either R or Python with Jupyter notebooks. For integration into enterprise applications, either language works equally well, since I encapsulate my R or Python code as a microservice, interfaced via a REST API.
If your project involves a large amount of general-purpose programming, in addition to data science work, Python is clearly the superior choice. However, once again, my recommendation is to encapsulate data science code in its own self-contained microservice. This means that you can call this code effortlessly using any general-purpose language like C# or Java. So, in this case, choosing R vs. Python essentially makes no difference.
Finally, if you’re just getting started with learning data science, I generally recommend two things. If you’re very comfortable with C-like programming languages, Python will feel very natural to you while you’re learning data science concepts. However, R was created in academia for teaching data science to students, so it does a much better job, in my opinion, as a learning tool for these data science concepts.
The key takeaway here is that there is no one perfect language for data science. Each of these languages has various pros and cons. Python might make the most sense in one scenario, while R might make more sense in another scenario.
What is most important is that you learn both languages and their pros and cons. Then choose whichever tool is best for the job at hand.