Rambling on Data Science Tools. My 5 Cents
February 28th, 2020
Updated: October 31st, 2020
If you are an aspiring data scientist, one of the first things you'll need to figure out is which tools or programming languages to learn. In this post, I'll cover most common tools used in data science, but before I list the tools, I'd like to clarify my definition of data science.
There is obviously machine learning which excites a lot of people, but machine learning is not the only part of data science. You can't build ML models without preparing your data first.
It involves accessing data from different sources, figuring out how to join it all together, filling out the blanks and sometimes fixing data inaccuracies. I'm sure many data scientists would agree that data preparation is the most time consuming part of most data science projects.
It also involves model deployment. While you can delegate model deployment or data prepping to someone else, I strongly believe that it's a good practice for a data scientist to be involved in every stage from data preparation, model creation, to model deployment because some companies may have data analysts, data engineers, machine learning engineers and data scientists, in others, you may need to wear a variety of hats and possess all these skills.
Data Science Tools
There is a variety of data science tools. Some of them are open source others may cost a fortune. Some industries are known for using certain software. If you are new to data science, it might be confusing figuring out where to start. Should you learn R? Is Python sufficient? What to use for data preparation? Below is an overview of a few tools I've worked with, my opinion on how important they are and how I learned them.
Python has become one of the most famous data science tools. It didn't really start as one. Python is a powerful high-level programming language that has been around for over 3 decades. It can be used for back end, web applications, scientific computing, plotting and of course data science. While there are a lot of data science libraries (pandas, matplotlib, scikit-learn, etc.), I strongly encourage every aspiring data scientist to have a solid knowledge of programming foundations in Python. Knowing how to create functions, getting comfortable with native data objects (lists, dictionaries, tuples), conditional logic,loops and debugging is a bare minimum. While it sounds basic, you would be surprised how many people use Python for data science and don't really know these concepts well.
How do you learn Python? I started with a few online courses that were helpful but also confusing. What really helped me is getting a basic understanding of the syntax and working on an actual project in Python during which I practically googled how to do every task I tried to accomplish. My first Python project worked as expected but was verbose and not efficient, and I get super embarrassed every time someone looks at the code; but it was a great learning opportunity and has definitely made be a better programmer.
Once I gained some understanding of what I was doing, I took another online course which made more sense to me after spending some time working in Python. The courses I took are Codecademy and Udacity. While Codecademy covers basics pretty well, I didn't like using their coding interface. When I was an absolute beginner, it took me some time to figure out how to even install Python. By the way, if you are struggling with Python installation, here is a post for you Installing Python. I suggest checking out Coursera U of Michigan Python Programming course and this book Here.
Overall, I think you should definitely learn Python. It's a versatile programming language that can be used to solve a variety of problems.
While you can't build predictive models in SQL, you really should get comfortable with writing SQL queries. Knowing the
select statement, different types of
where, case when, group by, union and some basic SQL commands should be a good start. You can check out this post that can provide an overview of basic SQL commands (Intro to SQL ). A lot of companies have their data in relational databases so knowing how to extract and manipulate data is absolutely important.
I gained SQL knowledge by observing my actuarial colleagues, looking through their scripts and trying to understand them. I also took an online course on Lynda (I'm not sure if Lynda still exists, but there should be tons of information on the Internet).
I'll have a separate post on Cloud Platforms since this one is getting pretty long. Cloud space is shared by three giants -- Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure, and you should have some exposure to cloud services. Getting comfortable with is important for at least model deployment.
There is a big debate in data science community about what's better R or Python. While Python seems to be getting more popular these days, we should not underestimate R. There are modeling methodologies (typically econometrics or statistical models) that are not supported by Python but supported by R. Not every model we build is a classification model, some have continuous targets that sometimes follow specific distributions and require certain modeling techniques.
Overall, R is still a powerful modeling tool. I wouldn't personally use R for model deployment, but these days it's more doable to productionalize R code. There is R Shiny interface which not only allows using R models in real time but it also allows you to build a web interface. AWS SageMaker has R as one of its supported languages, and I believe AWS Lambda also has layers that allow to run R code.
There are helpful R courses on Coursera. I took R programming several years ago. There is a paid certificate option or you can take the course for free but you won't have access to its quizzes. There are also other Online Platforms, such as Udemy and Codecademy. Udemy courses aren't free, but they seem to have major discounts from time to time where you can purchase a course for about $10. I would suggest starting with Coursera.
In my first role, we heavily used SAS for data preparation, visualization, modeling and deployment purposes. It required to learn some syntax which I found simpler than R or Python. Some of the products are point and click (Enterprise Miner) which allowed you to build predictive models.
I like SAS for data preparation purposes. I think certain tasks easier accomplished in SAS than in SQL and with less code. When it comes to modeling techniques, it has limited support of modeling algorithms and it takes time for SAS to implement new algorithms which usually come under a new SAS version. Did I mention the cost? SAS is expensive! There are some industries that heavily rely on SAS, so you may find yourself learning and using it at some point. It's not difficult to learn, and SAS typically offers courses and certificates which are not free but your company may have educational credits from SAS that can be used toward these courses.
Julia is definitely getting popular in data science community. It combines advantages and syntax of both R and Python. It also appears to be computationally more efficient than R or Python. Unfortunately, there are fewer resources on learning Julia, so I would recommend to maybe learn it after you get proficient with R or Python.
I do plan on writing Julia posts at some point in the future, so stay tuned!
The first analytics/predictive modeling tool I encountered was SPSS in graduate school. Back then its capabilities were limited, and we hardly did anything aside from descriptive statistics, hypothesis testing and logistic regressions in it.
I also saw a demo on SPSS a few years back, and it looked more promising. That said, I wouldn't recommend focusing on SPSS as your primary data science tool unless your organization requires it. In most cases, Python, R, and Cloud Services can be a good alternative.
If you are taking an online course, make sure you are not just strictly following course assignments. I encourage you to explore different ways of how you can get to the desired output. Also, make sure you consistently spend time practicing coding especially in the beginning while you still get comfortable with the syntax and programming concepts. Perhaps, you can find a topic that you enjoy, download relevant data and see what you can do with it in language/tool you are learning.
One last thing about course certificates. Sometimes certificate options provide you with access to course quizzes which may be useful. If you think that a certificate will increase your odds of getting a data science job, unfortunately it probably won't.
What can you do then? What can definitely help is having an up-to-date git repository where you can demonstrate your coding/data science skills. If you have no relevant experience, pick a challenging project that requires significant amount of data cleaning. Don't just go with the Titanic, Boston Housing Pricing, Iris or other common datasets -- these are not representative of data sets that you'd work with. Can you find a data source that would require joining multiple tables together? You can try utilizing data extraction via an API (Here is a post on how to make API calls in Python). Get creative with how you visualize data. Once you have a model, can you write deployment code? Even building a
Flask application in Python which can be tested locally would be sufficient to prove that you can handle deployment. An extra bonus, if you can have it hosted on the . If you have a project like this in your repository, it will demonstrate your familiarity with every stage of a data science project!
I hope this was somewhat useful and keep learning!