ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAFAGAHAIAJAKALAMANAOAPAQARASATAUAVAWAXAYAZBABBBCBDBEBFBGBH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
A CRITICAL FIELD GUIDE FOR WORKING WITH MACHINE LEARNING DATASETS
Written by Sarah Ciston {1}
Editors: Mike Ananny {2} and Kate Crawford {3}







Part of the Knowing Machines research project
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
TABLE OF CONTENTS
39
1. Introduction to Machine Learning Datasets
40
2. Benefits: Why Approach Datasets Critically?
41
3. Parts of a Dataset
42
4. Types of Datasets
43
5. Transforming Datasets
44
6. The Dataset Lifecycle
45
7. Cautions & Reflections from the Field
46
8. Conclusion
47
48
49
1
50
INTRODUCTION TO MACHINE LEARNING DATASETS
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
Maybe you’re an engineer creating a new machine vision system to track birds. You might be a journalist using social media data to research Costa Rican households. You could be a researcher who stumbled upon your university’s archive of handwritten census cards from 1939. Or a designer creating a chatbot that relies on large language models like GPT-3. Perhaps you’re an artist experimenting with visual style combinations using DALLE-2. Or maybe you’re an activist with an urgent story that needs telling, and you’re searching for the right dataset to tell it.
72
WELCOME.
73
No matter what kind of datasets you’re using or want to use, whether you’re curious but intimidated by machine learning or already comfortable, this work is complicated. Because machine learning relies on datasets, and because datasets are always tangled up in the ways they’re created and used, things can get messy. You may have questions like:
74
75
Does this dataset tell the story of my research in the way I want?
76
How do the dataset pre-processing methods I choose affect my outcomes?
77
How might this dataset contribute to creating errors or causing harm?
78
79
More than likely you will encounter at least some of these conundrums — as many of us who work with machine learning datasets do. Anyone using datasets will weigh choices and make tradeoffs. There are no universal answers and no perfect actions — just a tangle of dataset forms, formats, relationships, behaviors, histories, intentions, and contexts.
80
When choosing and using machine learning datasets, how do you deal with the issues they bring? How can you navigate the mess thoughtfully and intentionally? Let’s jump in.
81
82
83
84
85
INTRODUCTION TO MACHINE LEARNING DATASETS
86
1,1
87
WHAT IS THIS GUIDE ?
88
89
90
91
Machine learning datasets are powerful but unwieldy. They are often far too large to check all the data manually, to look for inaccurate labels, dehumanizing images, or other widespread issues. Despite the fact that datasets commonly contain problematic material — whether from a technical, legal, or ethical perspective — datasets are also valuable resources when handled carefully and critically. This guide offers questions, suggestions, strategies, and resources to help people work with existing machine learning datasets at every phase of their lifecycle. Equipped with this understanding, researchers and developers will be more capable of avoiding the problems unique to datasets. They will also be able to construct more reliable, robust solutions, or even explore promising new ways of thinking with machine learning datasets that are more critical and conscientious. {4}, {5}
92
93
If you aren’t sure whether this guide is for you, consider the many places you might find yourself working with machine learning datasets. This guide can be helpful if you are…
94
95
- making a model
96
- working with a pre-trained model
97
- researching an existing machine learning tool
98
- teaching with datasets
99
- creating an index or inventory
100
- concerned about how datasets describe you or your community