Installing Python and conda#
One of the major learning goals of this class is for you to be comfortable managing all the software and settings required for you to do data science on your own computer.
Why deal with all the headaches of setting up your own environment, you may ask? Why not just use a cloud platform like Google Colab or a virtual machine with everything already set up?
Getting data science tools installed and working together is, for better or worse, a pretty core part of the day-to-day life of data scientists, and learning how to troubleshoot problems quickly is an important skill for being productive in the profession. But it is a skill that takes time and energy to learn, and so in most classes — which want to focus on teaching topics like statistical analysis or programming concepts — instructors choose to provide students with clean, ready-to-use environments so everyone can focus on those topics. For example, if the MIDS Python Bootcamp included a module on setting up Python environments instead of providing you with a clean virtual machine, you’d probably end up learning ~25% less programming!
But the problem with this approach is that if every course you take pursues this strategy, you may find that you don’t feel empowered to go do data science yourself when those clean VMs are taken away at the end of the semester. Moreover, it means you may not know enough about how data science tools work to debug problems on your own when they come up.
So in this course, we’re going to address environment setup head-on. That will probably mean you’ll get a little annoyed at the fragility of many of these tools, and you may get frustrated spending hours trying to find a setting that got set wrong (though we’ll try to minimize these experiences!), but try to think of this time not as wasted, but instead as part of your data science education!
What We’ll Be Setting Up#
To set ourselves up for this course (and hopefully our careers!), we’ll need to set up the following things:
Python and the conda package manager: This is a Python-centric course, so the first thing we’ll need to do is install Python and a robust, data-science-appropriate package manager.
Visual Studio Code (VS Code): There are a lot of opinions (most strongly held) about what editor is “best.” My own view is that what editor is best depends entirely on not just the kind of work you do and your own working style, but also what the people around you use (nothing better than being able to ask ther person next to you for help!). But even more importantly, I think everyone who works in data science would agree that more important than picking the “correct” editor is becoming proficient in whatever editor you use. With that in mind, most Duke MIDS courses have decided to coordinate around VS Code to allow you the opportunity to get really, really good at VS Code. You may later decide it isn’t for you, but at least this way you’ll have a good sense of what a good editor can do.
Augmented Command Line: As a data scientist, you’ll spend a lot of time working at the command line, so it’s a good idea to invest a little in setting up something more advanced than the default command line tool offered by your operating system (e.g., Terminal/CMD Prompt/Powershell). In addition, this will give us a chance to learn a little about how the command line works, which will be really important to effective troubleshooting.
But that’s a lot, so let’s take things one step at a time! First, let’s install PYTHON!
Reading These Setup Instructions#
As you work through these set up readings, be certain to follow the directions very carefully! As a data scientist you are working at the frontier of software, which often means that there are little quirks and issues with the tools that we use that are just waiting to trap you. Every note you find in these readings I put there because either I ran into a problem or one of my students has, so please take your time and try to be very methodical!
Installing Python with Miniforge#
The first thing you’ll likely want to do on any computer you work with is install both Python and the package manager conda. This is necessary because unlike a language like R where you can install packages with the install.packages()
command, Python doesn’t have an internal tool for installing packages. This means that we need a tool like conda if we want to use anything other than vanilla Python (e.g., tools for plotting, numpy, pandas, etc.).
Python has two main package managers: pip
and conda
. While most software engineers use pip, most data scientists like conda. That’s because while pip is good at installing Python libraries, conda is better at installing many of the big dependencies that underlie data science tools. Plus, if we install conda, it will come with pip, so we get the best of both worlds!
Why Miniforge?#
So the first thing we need to do to get started with Python is go to the Miniforge download page and download the most recent installer for our system (that’s Conda 24.5.0 and Python 3.12.4 as of August 2024). Clicking the appropriate link on this page should cause a file to be downloaded with a name like Miniforge3-MacOSX-arm64.sh
. Don’t open that yet.
Note that there are actually two well-known ways to get conda on your system — installing Anaconda from anaconda.com, installing Miniconda from docs.conda.io, or installing miniforge. It is my strong recommendation that you use miniforge. That’s because if you install Anaconda from anaconda.com, you get not only Python and the conda package manager, but also dozens of pre-loaded packages. And while that sounds great, the reality is that it tends to cause lots of package conflicts once you start adding anything new to your installation.
Miniforge, as the name implies, is the “mini” version of the Anaconda package, and basically only includes Python and a couple core utilities (conda, pip, etc.). As a result, a Miniforge installation is much less likely to cause package conflict problems down the road.
In addition, miniforge (as opposed to miniconda) is set up to download packages from the conda-forge
channel (essentially, a “channel” is a repository with copies of packages). In my experience, this is the best place to get your packages to avoid conflicts.
If you already have a conda installation: Delete it and start fresh. Deleting your Python installation can feel scary once you’ve set stuff up, but you don’t want to get in the practice of being too precious about your Python installations, as you’ll often have to just delete it all to deal with software conflicts.
Thankfully, deleting Anaconda/Miniconda/Miniforge is easy — just delete the miniconda3
/ anaconda3
/ miniforge3
folder you created during installation! The great thing about conda is that everything lives in that folder, so you can easily delete it and start fresh!
An IMPORTANT Note on Pyenv#
miniforge is a SUBSTITUTE for a tool like
pyenv
/venv
that may be suggested in some other courses (like our MIDS NLP class). Do NOT install pyenv / venv and miniforge, just install miniforge. (This is the coordinated recommendation of myself and the Duke MIDS NLP Professor Patrick Wang!)
Miniforge comes with two tools for installing packages that you can use together: conda
and pip
. As we’ll discuss in a later reading, my suggestion is to always try to install things with conda
first (e.g., run conda install numpy
), and if that fails try pip
(e.g., run pip install numpy
).
conda can also manage multiple environments and something called “environments”, so I promise anything pyenv can do conda can do too (and much more!), so install miniforge but not pyenv.
Installation#
Go to the miniforge install page. Click the link for your operating system. Clicking the appropriate link on this page should cause a file to be downloaded with a name like
Miniforge3-MacOSX-arm64.sh
(on a mac) orMiniforge3-Windows-x86_64.exe
(windows).If you’re on a mac:
Move that file that ends in
.sh
to your desktop.Open
Terminal
. It’s inApplications > Utilities
. This may feel scary, but I promise you can do it!Type
cd ~/desktop
and hitEnter
.Type
bash Miniforge3-MacOSX-arm64.sh
(if your file name is different — for example if your computer has an Intel chip, it might be calledMiniforge3-MacOSX-x86_64.sh
— use that file name). HitEnter
. That will launch the installer.When asked to review agreement, hit
Enter
. A new screen will come up with lots of text. At the bottom you’ll see a:
. Just typeq
and it will go away.Type
yes
and hitEnter
when asked if you’ll accept the agreement.Hit enter to accept default installation location.
After installation, you’ll be asked
Do you wish to update your shell profile to automatically initialize conda?
followed by some additional information. Typeyes
and hitEnter
.
If you’re on windows: Run the
.exe
file that was downloaded (probably something likeMiniforge3-Windows-x86_64.exe
).When asked whether to install it for just the current user or all users, select “Just for me (recommended).”
When asked where to install miniforge3, the default (in your user folder) is great.
You’ll then be faced with a handful of check boxes. Check the one to
Add Miniforge 3 to my PATH environment variable
(even if it says it isn’t recommended) and the one toRegister Miniforge3 as my default Python 3.12
. The box for removing the cache is a good one to check too (but isn’t necessary), and you can create a shortcut if you want (but you won’t use it).
And that’s it!#
You now have Python on your system, as well as all the tools you’ll need for managing packages!