I’ve been milling over this post about big data in the IT
world
for quite a while. It basically says that given large (and growing) data
sets, companies that didn’t previously need data researchers suddenly
need people to help them use “big data.” Everyone company is a data
company. In effect we have an ironic counter example to the effect of
automation on the need for labor.
These would be “data managers” have their work cut out for them. Data
has value, sure, but unlike software which has value as long as someone
knows how to use it, poorly utilized data is just as good as no
data. Data researchers need to be able to talk to developers and help
figure out what data is worth collecting and what data isn’t.
Organizations need someone to determine what data has real value, if
only to solve a storage-related problem. Perhaps more importantly data
managers would need to guide data usage both technically (in terms of
algorithms, and software design) and in terms of being able to
understand the abilities and shortfalls of data sets.
There’s a long history of IT specialist positions: database developers,
systems administrators, quality assurance engineers, release
engineering, and software testing. Typically we divide this between
developers and operations folks, but even the development/operations
division is something of a misnomer. There are merits to generalism and
specialization, but as projects grow, specialization makes sense and
data may just be another specialty in a long tradition of software
development and IT organization.
Speicailization also makes a lot of sense in the context of data, where
having a lot of unusable data adds no value and can potentially subtract
value from an organization.
A Step Back#
There are two very fundamental points that I’ve left undefined: what
“data” am I talking about and what kinds of skills differentiate
“data specialists” from other kinds of technicians.
What are big data?#
Big data sets are, to my mind, large collections of data, GIS/map based
information, “crowd sourced” information, and data that is
automatically collected through the course of normal internet activity.
Big data is enabled by increasingly powerful databases and the ubiquity
of the computing power, which lets developers process data on large
scales. For examples: the aggregate data from foursquare and other
similar services, comprehensive records of user activity within websites
and applications, service monitoring data and records, audit trails of
activity on shared file systems, transaction data from credit cards and
customers, tracking data from marketing campaigns.
With so much activity online, it’s easier for software developers and
users (which is basically everyone, directly or otherwise) to create and
collect a really large collection of data regarding otherwise trivial
events. Mobile devices and linkable accounts (OpenID, and other single
sign-on systems) simplify this process. The thought and hope is all this
data equals value and in many circumstances it does. Sometimes, it
probably just makes things more complicated.
Data Specialists#
Obviously every programmer is a kind of “data specialist” and the last
seven or eight years of the Internet has done everything to make every
programmer a data specialist. What the Internet hasn’t done is give
programers a sense of basic human factors knowledge, or a background in
fundamental quantitative psychology and sociology. Software development
groups need people who know what kinds of questions data can and
cannot answer regardless of what kind or how much data is present.
Data managers, thus would be one of those posistions that sits
between/with technical staff and business staff, and perhaps I’m
partial to work in this kind of space, because this is very much my
Chance. But there’s a lot of work in bridging this divide, and a great
deal of value to be realized in this space. And it’s not like there’s
a shortage of really bright people who know a lot about data and social
science who would be a great asset to pretty much any development team.
Big Data Beyond Software Development#
The part of this post that I’ve been struggling over for a long time is
the mirror of what I’ve been talking about thus far. In short, do
recent advancements in data processing and storage (NoSQL, Map Reduce,
etc.) that have primarily transpired amonst startups, technology
incubators, and other “Industry” sources have the potential to help
acdemic research? Are there examples of academics using data collected
from the usage habits of websites to draw conclusions about media
interaction, reading habits, cultural particpation/formation? If nothing
else are sociologists keeping up with “new/big data” developents? And
perhaps most importantly, does the prospect of being able to access and
process large and expansive datasets have any affect on the way social
scientists work? Hopefully someone who knows more about this than I do
will offer answers!