25 Best Free Datasets for Machine Learning

Last updated on Apr 05,2024 8.5K Views

Disha works as a technical content writer for Edureka.

Datasets are an integral part of machine learning and NLP (Natural Language Processing). Without training datasets, machine-learning algorithms would not have a way to learn text mining, text classification, or how to categorize products. 5-10 years ago it was very difficult to find datasets for machine learning and data science and projects. But now we’ve been flooded with lists of datasets and now the problem is not finding a dataset, rather sifting through them to keep the relevant ones. So, in this article, we have curated a list of free datasets for machine learning for you. 

Datasets for General Machine Learning

General Machine Learning Datasets | Edureka Blogs | EdurekaIn this context, “general” is referred to as Regression, Classification, and Clustering with relational data. 

Wine Quality  – Properties of red and white vinho verde wine samples from the north of Portugal. The goal here is to model wine quality based on some physicochemical tests. 

Credit Card Default  – Predicting credit card default is a valuable use for machine learning. This dataset includes payment history, demographics, credit, and default data.

US Census Data  – Clustering based on demographics is a tried and tested way to perform market research as well as segmentation.

Datasets for Natural Language Processing

Datasets for Natural Language Processing | Edureka Blogs | Edureka

NLP is all about text data. And for data like text, it’s important for the datasets to have real-world applications so that sanity checks can be performed easily.

Enron Dataset – Email data from the senior management of Enron that is organized into folders.

Amazon Reviews – It contains approximately 35 million reviews from Amazon spanning 18 years. Data includes user information, product information, ratings, and text review.

Newsgroup Classification – Collection of almost 20,000 newsgroup documents, partitioned evenly across 20 newsgroups. It is great for practicing topic modeling and text classification.


Finance & Economics Datasets for Machine Learning

Finance & Economics Datasets for Machine Learning | Edureka Blogs | Edureka

Financial quantitative records are kept for decades, hence this industry is perfectly suited for machine learning. 

Quandl: A great source of economic and financial data that is useful to build models to predict stock prices or economic indicators.

World Bank Open Data: Covers population demographics and a large number of economic and development indicators across the world.

IMF Data: The International Monetary Fund (IMF) publishes data on international finances, foreign exchange reserves, debt rates, commodity prices, and investments.

Image Datasets for Computer Vision

Image Datasets for Computer Vision | Edureka Blogs | Edureka

Image datasets are useful to train a wide range of computer vision applications, like medical imaging technology, face recognition, and autonomous vehicles.

ImageNet: This de-facto image dataset for new algorithms is organized according to the WordNet hierarchy, where each node is depicted by hundreds and thousands of images.

Google’s Open Images: A collection of around 9 million URLs to images annotated with labels spanning over 6,000 categories under Creative Commons.

Indoor Scene Recognition: A specific dataset that contains 67 Indoor categories, and a total of 15620 images.

Sentiment Analysis Datasets for Machine Learning

Sentiment Analysis Datasets for Machine Learning | Edureka Blogs | Edureka

Multidomain sentiment analysis dataset – Features product reviews from Amazon.

IMDB Reviews – Dataset for binary sentiment classification. It features 25,000 movie reviews.

Sentiment140 – Uses 160,000 tweets with emoticons pre-removed.

Datasets for Deep Learning

Datasets for Deep Learning | Edureka Blogs | Edureka

MNIST – Contains images for handwritten digit classification. It is considered a good entry dataset for deep learning as it is complex enough to warrant neural networks while being manageable on a single CPU. 

CIFAR – Contains 60,000 images broken into 10 different classes. 

YouTube 8M – Contains millions of YouTube video IDs and billions of audio and visual features pre-extracted by the latest deep learning models.

Public Government Datasets for Machine Learning

Machine learning models trained using public government data help policymakers to identify trends and prepare for issues related to population growth, aging, and migration.

Food Environment Atlas – Contains data for local food choices that affect diet in the US.

Chronic disease data – Contains data on chronic disease indicators across the US.

The US National Center for Education Statistics Data on educational institutions and education demographics from around the world.

Datasets for Autonomous Vehicles

Datasets for Autonomous Vehicles | Edureka Blogs | Edureka

Autonomous vehicles need to be trained with large amounts of quality datasets so that they can perceive their environment and surrounding objects accurately.

Berkeley DeepDrive BDD100k – The largest dataset for self-driving AI. It contains around 100,000 videos of over 1,100-hour driving experiences at different times and weather conditions. 

Baidu Apolloscapes – Defines 26 different semantic items  like cars, cycles, pedestrians, buildings, etc.

Oxford’s Robotic Car – Over 100 repetitions of the same route through Oxford, UK, captured over a year. The dataset captures different combinations of traffic, weather, and pedestrians, along with changes like construction and roadworks.

KUL Belgium Traffic Sign Dataset – Contains more than 10000+ traffic sign annotations from thousands of traffic signs in the Flanders region in Belgium.

