机器学习数据集_机器学习数据集的选择

机器学习数据集

Before you is an article guide to open data sets for machine learning. In it, I, for a start, will collect a selection of interesting and fresh (relatively) datasets. And as a bonus, at the end of the article, I will attach useful links on independent search of datasets.
在您之前，是一篇文章指南，介绍了用于机器学习的开放数据集。首先，我将收集其中一些有趣且新鲜的(相对)数据集。另外，在文章结尾，我将在独立搜索数据集上附加有用的链接。
Less words, more data.
更少的单词，更多的数据。

一系列用于机器学习的数据集： (A selection of datasets for machine learning:)

Data deaths and battles from the game of thrones — This data set combines three data sources, each based on information from a series of books.

权力游戏中的数据死亡和争斗 -该数据集结合了三个数据源，每个数据源都基于一系列书籍中的信息。
Global Terrorism Database — Over 180,000 terrorist attacks worldwide, 1970-2017.

全球恐怖主义数据库 -1970-2017年，全球超过18万起恐怖袭击。
Bitcoin, historical data — Bitcoin data with an interval of 1 minute from selected exchanges, January 2012 — March 2019

比特币，历史数据 —距离选定交易所1分钟间隔的比特币数据，2012年1月至2019年3月
FIFA 19 full set of player data — 18k + FIFA 19 players, ~ 90 attributes, extracted from the latest FIFA database.

FIFA 19完整的球员数据集 – 18k + FIFA 19球员，〜90个属性，从最新的FIFA数据库中提取。
YouTube video statistics — daily statistics of trend videos on YouTube.

YouTube视频统计信息-YouTube上趋势视频的每日统计信息。
Survey of suicide rates from 1985 to 2016 — Comparison of socio-economic information with suicide rates by year and country.

1985年至2016年自杀率调查 -按年份和国家/地区比较社会经济信息和自杀率。
Huge stock market data set — historical daily prices and volumes of all US stocks and ETFs.

庞大的股市数据集 -所有美国股票和ETF的历史每日价格和交易量。
World Development Indicators — development indicators of countries from around the world.

世界发展指标 - 世界各国的发展指标。
Kaggle Machine Learning & Data Science Survey 2017 — Great insight into the state of data science and machine learning.

Kaggle机器学习和数据科学调查2017 —深入了解数据科学和机器学习的状态。
Data on violence and weapons — a full report on more than 260 thousand American weapon incidents in 2013-2018

暴力和武器数据 -2013-2018年间超过26万起美国武器事件的完整报告
Chest X-ray (pneumonia) — 5,863 images, 2 categories.

胸部X光片(肺炎) -5,863张图像，分为2类。
Gender recognition by voice — This database was created to identify the voice as male or female, based on the acoustic properties of voice and speech. The data set consists of 3168 recorded voice samples collected from men and women.

通过语音进行性别识别 -基于语音和语音的声学特性，创建该数据库以将语音识别为男性还是女性。数据集包括从男性和女性收集的3168个录制的语音样本。
Student alcohol consumption — data was obtained in a survey of students in mathematics and Portuguese language courses in high school. It contains a lot of interesting social, gender and educational information about students.

学生饮酒量 -数据来自对高中数学和葡萄牙语课程的学生的调查。它包含许多有关学生的有趣的社会，性别和教育信息。
Malaria Cell Dataset — cellular images to detect malaria.

疟疾细胞数据集 -用于检测疟疾的细胞图像。
Surveys of young people — data on the preferences, interests, habits, opinions and fears of young people.

青年人调查-有关青年人的偏好，兴趣，习惯，观点和恐惧的数据。
World University Rankings — explore the best universities in the world.

世界大学排名 -探索世界上最好的大学。
Credit Card Fraud Detection — Anonymous credit card transactions are marked fraudulent or authentic.

信用卡欺诈检测 -匿名信用卡交易被标记为欺诈或真实。
Dataset heart disease — This database contains 76 attributes, such as age, gender, chest pain type, resting blood pressure and others.

数据集心脏病 -该数据库包含76个属性，例如年龄，性别，胸痛类型，静息血压等。
European Football Base — 25 000+ matches, attributes of players and teams for European professional football.

欧洲足球基地 -欧洲职业足球的25 000场比赛，球员和球队的属性。
Wine Reviews — 130k wine reviews with variety, location, winery, price and description.

酒评 — 130k酒评，包括品种，位置，酒庄，价格和描述。
Baidu Apolloscapes. A large dataset for recognizing 26 semantically different objects like cars, bicycles, pedestrians, buildings, street lamps, etc.

百度Apolloscapes 。大型数据集，可识别26个语义上不同的对象，例如汽车，自行车，行人，建筑物，路灯等。
Comma.ai. More than seven hours driving on the highway. Dataset includes information about the speed of the vehicle, acceleration, steering angle and GPS coordinates.

Comma.ai 。在高速公路上开车超过七个小时。数据集包括有关车辆速度，加速度，转向角和GPS坐标的信息。
Color recognition — This dataset contains 4242 color images. Data collection is based on flicr data, Google images, Yandex images.

颜色识别 -该数据集包含4242张彩色图像。数据收集基于模糊数据，Google图像，Yandex图像。
Daily market price of each cryptocurrency — historical cryptocurrency prices for all tokens.

每种加密货币的每日市场价格 -所有代币的历史加密货币价格。
Chocolate rating — Expert rating of more than 1,700 chocolate bars.

巧克力评级 -超过1,700块巧克力的专家评级。
Medical insurance market — data on health and dental plans for the US health insurance market.

医疗保险市场 -美国医疗保险市场的健康和牙科计划数据。
Heartbeat sounds — classification of heartbeat abnormalities by stethoscope.

心跳声音 -用听诊器对心跳异常进行分类。
Anime Recommendations Database — recommendations from 76,000 users on myanimelist.net

动漫推荐数据库 -myanimelist.net上来自76,000位用户的推荐
Blood cell images — 12,500 images: 4 different types of cells.

血细胞图像 — 12,500张图像：4种不同类型的细胞。
Chest x-ray — over 112,000 chest radiographs from over 30,000 unique patients.

胸部X光检查 -来自30,000多名独特患者的超过112,000张胸部X光片。
Murder reports, 1980-2014 — The Kill Responsibility Project is the most comprehensive homicide database in the United States currently available..

谋杀案报告，1980-2014年-杀人责任项目是美国目前最全面的凶杀数据库。
Used car database — over 370,000 used cars. The data content is in German, so you must first translate it if you do not speak German.

二手车数据库 -超过370,000辆二手车。数据内容为德语，因此如果您不会说德语，则必须先将其翻译。
US Government Open Data House — data, tools and resources for conducting research, developing web applications and mobile applications, developing data visualizations.

美国政府开放数据中心 -用于进行研究，开发Web应用程序和移动应用程序，开发数据可视化的数据，工具和资源。
National Center chronic disease prevention and health promotion (NCCDPHP). The center is working to reduce the risk factors for chronic diseases.

国家中心慢性病预防和健康促进中心(NCCDPHP)。该中心正在努力减少慢性病的危险因素。
Largest in the UK a collection of social, economic and demographic resources.

英国最大的社会，经济和人口资源资源。
EconData — нSeveral thousand economic time series, prepared by a number of US government agencies and distributed in various formats and media.

EconData —数以千计的经济时间序列，由许多美国政府机构编制，并以各种格式和媒体进行分发。
Coast Research Center — interesting data on the sea and its biological composition. Here you can find datasets from the analysis of data from the Red Sea model to the study of temperature and currents over the narrow southern California shelf.

海岸研究中心 -有关海洋及其生物成分的有趣数据。在这里，您可以从对红海模型的数据分析到对南加州狭窄架子上的温度和洋流的研究中找到数据集。
Sign Language Digits Data Set — Turkey, Ankara, Ayranji, Anadolu. High school sign language data set.

手语数字数据集 -土耳其，安卡拉，艾兰吉，阿纳多卢。高中手语数据集。
Quality red wine — simple and clear practical data set for regression or classification modeling.

优质红酒 -用于回归或分类建模的简单明了的实用数据集。
Spreadsheets English Football Premier League (1968-2019).

电子表格英式足球超级联赛(1968-2019)。
HotspotQA Dataset — Dataset with questions and answers, allowing you to create a system for answering questions in a more understandable way.

HotspotQA数据集 -包含问题和答案的数据集，使您可以创建一个系统以更易理解的方式回答问题。
xView — one of the largest publicly available sets of aerial imagery of the earth. It contains images of various scenes from around the world, annotated with bounding boxes.

xView-地球上最大的公开可用的航空影像集之一。它包含来自世界各地的各种场景的图像，并带有边框。
Labelme — Large annotated image dataset.

Labelme — 带有注释的大型图像数据集。
ImageNet — Dataset of images for new algorithms, organized according to the WordNet hierarchy, in which hundreds and thousands of images represent each node of the hierarchy.

ImageNet —根据WordNet层次结构组织的用于新算法的图像数据集，其中成千上万的图像代表层次结构的每个节点。
LSUN. — Datasets of images, divided into scenes and categories with partial marking data.

LSUN。 —图像数据集，分为场景和类别，并带有部分标记数据。
MS COCO — large-scale dataset for detection and segmentation of objects.

MS COCO —用于检测和分割对象的大规模数据集。
COIL100 — 100 different objects depicted at every angle in a circular rotation.

COIL100-沿圆周旋转的每个角度描绘100个不同的对象。
Visual Genome — Dataset with ~ 100 thousand. Detailed annotated images.

视觉基因组 —数据集约为10万。详细的带注释的图像。
Google’s Open Images. — a collection of 9 million URLs to images “tagged with more than 6,000 categories” under the Creative Commons license.

Google的开放图片。 —根据知识共享许可，包含900万个URL的图像URL的集合，这些图像“被标记了6,000多个类别”。
Labelled Faces in the Wild — a set of 13,000 marked face images of people for use of applications that involve the use of face recognition technology.

带标签的野外面Kong -一组13,000张带有标记的人脸图像，用于涉及人脸识别技术的应用程序的使用。
Stanford Dogs Dataset — contains 20,580 images of 120 dog breeds.

斯坦福狗数据集 -包含120种狗的20580张图像。
Indoor Scene Recognition. — Dataset for recognizing the interior of buildings. Contains 15,620 images and 67 categories.

室内场景识别。 —用于识别建筑物内部的数据集。包含15,620张图像和67个类别。
Oxford’s Robotic Car — more than 100 repetitions of one route across Oxford, filmed during the year. Various combinations of weather conditions, traffic and pedestrians, as well as longer changes, like road works, got into datasets.

牛津的机器人车 -一年中拍摄的穿越牛津的一条路线的100多次重复。天气条件，交通和行人的各种组合以及道路工程等更长的变化进入了数据集。
Cityscape Dataset — a large dataset containing records of a hundred street scenes in 50 cities.

Cityscape数据集 -一个大型数据集，其中包含50个城市的一百个街道场景的记录。
KUL Belgium Traffic Sign Dataset — over 10,000 annotations of thousands of different traffic lights in Belgium.

KUL比利时交通标志数据集 - 比利时上万种不同交通信号灯的10,000多个注释。
LISA Laboratory for Intelligent & Safe Automobiles — Dataset with road signs, traffic lights, recognized vehicles and trajectories of movement.

LISA智能与安全汽车实验室 -带有路标，交通信号灯，公认的车辆和运动轨迹的数据集。
Bosch Small Traffic Light Dataset — Dates with 24,000 annotated traffic lights.

博世小型交通信号灯数据集 -日期带有24,000个带注释的交通信号灯。
WPI datasets — Dataset for recognition of traffic lights, pedestrians and road markings.

WPI数据集 —用于识别交通信号灯，行人和道路标记的数据集。
Berkeley DeepDrive — huge dataset for autopilots. It contains over 100,000 videos with more than 1,100 hours of driving records at different times of the day and in different weather conditions.

Berkeley DeepDrive —自动驾驶仪的巨大数据集。它包含了100,000多个视频，在一天中的不同时间和不同的天气条件下，记录了1,100多个小时的行车记录。
MIMIC-III — Datasets with impersonal data on the health status of ~ 40,000 patients on intensive care (demographic data, vital signs, laboratory tests and drugs).

MIMIC-III —包含约40,000名重症监护患者健康状况的非个人数据的数据集(人口统计数据，生命体征，实验室检查和药物)。
Amazon Reviews — Contains about 35 million reviews from Amazon for 18 years. Data includes product and user information, ratings and the text of the review itself.

亚马逊评论 -包含18年以来来自亚马逊的约3500万条评论。数据包括产品和用户信息，评分以及评论本身的文字。

搜索数据集的有用链接： (Useful links for searching datasets:)

Surely Kaggle — meeting place for all fans of machine learning competitions.

当然是Kaggle —机器学习竞赛的所有爱好者的聚会场所。
Google Dataset Search — Search datasets throughout the Internet. Also, if necessary, you can add own data sets.

Google数据集搜索 -搜索整个Internet上的数据集。另外，如有必要，您可以添加自己的数据集。
Machine Learning Repository — a set of databases, domain theories and data generators that are used by the machine learning community for empirical analysis of machine learning algorithms.

机器学习存储库 -机器学习社区用于对机器学习算法进行实证分析的一组数据库，领域理论和数据生成器。
VisualData — dataset search for machine vision, with convenient classification by category.

VisualData —用于机器视觉的数据集搜索，具有按类别方便的分类。
DATA USA — complete set of publicly available US data with visualization, description and infographics.

DATA USA —具有可视化，描述和信息图表的完整的美国公开数据集。

On this, our short selection came to an end. If someone has something to add or share — write in the comments.
在此基础上，我们的简短选择结束了。如果某人有东西要添加或分享，请在评论中写。
Thanks!
谢谢！

翻译自: https://habr.com/en/post/452740/

机器学习数据集