How to Label Data for Machine Learning in Python
Before we start: This Python tutorial is a part of our series of Python Package tutorials.
Data labeling in Machine Learning (ML) is the process of assigning labels to subsets of data based on its characteristics. Data labeling takes unlabeled datasets and augments each piece of data with informative labels or tags.
Most commonly, data is annotated with a text label. However, there are many use cases for labeling data with other types of labels. Labels provide context for data ranging from images to audio recordings to x-rays, and more.
Data Labeling Procedure
While data has traditionally been labeled manually, the process is slow and resource-intensive. Instead, ML models or algorithms can be used to automatically label data by first training them on a subset of data that has been labeled manually.
One way to automate data labeling is to use a workflow that can identify when the labeling model has higher or lower confidence in its results, and pass the data to humans to do the labeling when lower confidence arises. The new human-generated labels can then be provided back to the labeling model for it to learn from and improve its ability to automatically label the next set of data.
Over time, the model will label more and more data automatically, and the process will accelerate. However, data labeling is often a slow and repetitive task. In order to streamline the process, various tools have been developed.
How to Use Label Studio to Automatically Label Data
One automated labeling tool is Label Studio, an open source Python tool that lets you label various data types including text, images, audio, videos, and time series.
1. To install Label Studio, open a command window or terminal, and enter:
pip install -U label-studio
python -m pip install -U label-studio
2. To create a labeling project, run the following command:
label-studio init <project_name>
Once the project has been created, you will receive a message stating:
Label Studio has been successfully initialized. Check project states in .\<project_name> Start the server: label-studio start .\<project_name>
3. To start the project run the following command:
label-studio start .\<project-name>
label-studio start <project-name>
The project will automatically load in your web browser at
4. Click on the Import button to import your data from various sources.
Once the data is imported, you can scroll down the page and preview it.
5. In the menu, click on Settings to continue:
You can now choose among the many options to finish setup for your specific project.
The following tutorials will provide you with step-by-step instructions on how to work with machine learning Python packages:
Get a version of Python, pre-compiled with Scikit-learn and other popular ML Packages
ActiveState Python is the trusted Python distribution for Windows, Linux and Mac, pre-bundled with top Python packages for machine learning – free for development use.
Some Popular ML Packages You Get Pre-compiled – With ActiveState Python
- TensorFlow (deep learning with neural networks)*
- scikit-learn (machine learning algorithms)
- keras (high-level neural networks API)
- pandas (data analysis)
- NumPy (multidimensional arrays)
- SciPy (algorithms to use with numpy)
- HDF5 (store & manipulate data)
- matplotlib (data visualization)
Why use ActiveState Python instead of open source Python?
While the open source distribution of Python may be satisfactory for an individual, it doesn’t always meet the support, security, or platform requirements of large organizations.
This is why organizations choose ActiveState Python for their data science, big data processing and statistical analysis needs.
Pre-bundled with the most important packages Data Scientists need, ActiveState Python is pre-compiled so you and your team don’t have to waste time configuring the open source distribution. You can focus on what’s important–spending more time building algorithms and predictive models against your big data sources, and less time on system configuration.
ActiveState Python is 100% compatible with the open source Python distribution and provides the security and commercial support that your organization requires.
With ActiveState Python you can explore and manipulate data, run statistical analysis, and deliver visualizations to share insights with your business users and executives sooner–no matter where your data lives.