46 Comments
User's avatar
Kzak's avatar

Pro move: upload the file with an academic publication, then use the massive citations as social proof when people ask why aren't you in academia.

Lift yourself up by your own thumb.

Aella's avatar

You can upload it to an academic publication??

Austin Wallace's avatar

I would reccommend uploading to something like https://zenodo.org!

From their site:

"Why use Zenodo?

Safe — Your research is stored safely for the future in CERN’s Data Centre for as long as CERN exists

Trusted — Built and operated by CERN and OpenAIRE to ensure that everyone can join in Open Science

Citeable — Every upload is assigned a Digital Object Identifier (DOI), to make them citable and trackable

No waiting time — Uploads are made available online as soon as you hit publish, and your DOI is registered within seconds

Open or closed — Share e.g. anonymized clinical trial data with only medical professionals via our restricted access mode

Versioning — Easily update your dataset with our versioning feature

GitHub integration — Easily preserve your GitHub repository in Zenodo

Usage statistics — All uploads display standards compliant usage statistics"

Kzak's avatar

I didnt know about this one. Are the dataset not linked to a journal article in any way?

Anonymous Dude's avatar

More locations, harder to disappear. As many as possible!

User's avatar
Comment deleted
Feb 12
Comment deleted
Austin Wallace's avatar

Oh no! Apologies for that

https://data.mendeley.com is another good option. I’ve confirmed that I can sign up and publish a dataset without an .edu email

https://osf.io is another option if the underlying Elsevier name leaves a bad taste on your mouth.

Aella's avatar

sorry, it said I couldn't but i tried anyway and it let me sign up.

I've published the dataset on zenodo, thanks!

Austin Wallace's avatar

Yay! I would love to see this get cited by the broader scientific community for multiple reasons.

Austin Wallace's avatar

I know you know this, but probably worth posting the DOI/link in this post and on X

Earnest & Rose's avatar

This is so great!

Can I suggest that you make the zenodo description sound more professional? I took your description and just ran it through an LLM real quick and told it to make it sound all scholarly and professional (I pasted its output below). A fancier-feeling description will help it be something that researchers are more comfortable citing. Also, can I very strongly recommend that you make up a last name as well? It makes them feel more professional when they can cite your last name. Single name people slightly mess up citations in a way that they don't like.

It's really dumb, but such is the way the scholarly landscape works, and I want to see you get as many citations and as much credit as possible for having the best work out there. Get better ammo to put the naysayers to shame! And these two little tweaks can boost your chances.

"""

The file provided here is a representative subsample of approximately 15,000 respondents drawn from the Big Kink Survey (total n = 970,000), a large and comparatively comprehensive online survey of sexual interests and fetishes that attained viral dissemination. The original respondent pool was substantially demographically skewed (younger, female, liberal, and non-cisgender). The present dataset was constructed as a smaller, comparatively more demographically representative subsample. For reasons of coverage and anonymity, the analytic sample was restricted to respondents aged 32 or younger, as the survey yielded markedly fewer older participants. The dataset is further restricted to respondents located in the United States, Canada, and Europe.

Post-stratification and balancing were conducted using a broader set of demographic variables than those retained in this release; following balancing, multiple demographic columns were removed to reduce re-identification risk. In addition, the dataset has been anonymized through several rounds of similar-row demographic swapping and the introduction of noise across the dataset. As a consequence, bivariate associations and related correlation estimates are expected to be attenuated relative to the original subsample: in general, correlations are approximately 25% weaker than in the pre-noise version, whereas base rates for most items are less materially affected.

This companion folder [LINK] is provided with additional documentation, including survey back-end materials containing the full item wordings and the internal labeling scheme used by the instrument.

Initial data cleaning was intentionally minimal. Exclusions included respondents who reported answering dishonestly, respondents who completed the survey implausibly quickly, extreme response-pattern outliers (e.g., endorsing nearly all items), respondents exhibiting internal inconsistencies (e.g., indicating both high and negligible interest in the same fetish in different sections), and all respondents reporting an age of 69. Additional structural cleaning was performed to address artifacts introduced by the survey platform, principally duplicated columns in which responses were split across parallel variables.

Finally, note that the items pertaining to dirty talk and cunnilingus may be inadvertently reversed in the cleaned dataset; users should verify item mapping against the accompanying instrument documentation before analysis.

"""

Jeremy R Cole's avatar

You could certainly upload an academic publication describing how the data was collected, and link to the data.

Kzak's avatar

Many journals allow datasets to be uploaded along with the study. I am not familiar with social sciences but there should be some for the field.

Even if not, you can still make an article about how you gathered the data and some results, and link to the dataset. Then in the place where you store the dataset you add a reference to your article and ask people to cite it when using it (you see this a lot in Github for engineering works).

If you dont feel confident maybe contact some researchers in the field and ask them. I would expect you can easily get it published, any journal that gets it is getting an easy win.

Austin Wallace's avatar

I’m building a website to explore this data, please reply if you would like early access and to give feedback or contribute!

Edit: I'm ready to show this more publicly, and am open to any and all feedback:

https://www.austinwallace.ca/survey

Alliej's avatar

I’d be interested please

Jim Jones's avatar

I'd like to check it out

Tyler Morningstar's avatar

I’d be curious as well.

Anonymous Dude's avatar

First of all, thanks so much for doing this! Huge amount of work and very helpful!

I clicked through to Explore Two Questions and when I went to, say, "Engaging with or fantasizing about what arouses me feels..." as the first question and "Conscientiousness" as the second, I got the "Engaging with or fantasizing about what arouses me feels..." sorted in some weird order that goes 0, 1, 2, 3, -1, -3, -2. Then going to "I am aroused by being dominant in sexual interactions" it goes 1, 2, -2, -1, 0, 3, -3. Looking at the detailed data table, it seems to sort by frequency of response rather than in order as you would expect for ordinal or ratio data like a number.

Also since a lot of these go -3 to 3 with a genuine zero point you might consider putting in a correlation coefficient or a heat map. (EDIT: saw there is a correlation coefficient, it's just small.)

No age categories over 32--is this just the sample?

Austin Wallace's avatar

Thank you for the detailed feedback! I appreciate you taking the time; thanks to you I’ve made some updates.

I fixed the ordering bug and added both heatmap coloring and Spearman correlation!

And yes, there are no 32+ in the larger sample data.

Anonymous Dude's avatar

Can confirm both! Thank you so much! Wow!

Austin Wallace's avatar

It’s ready for publication! Please take a look and would love some help getting early visibility https://x.com/austeane/status/2023970031739035669?s=20

Austin Wallace's avatar

I'm ready to show this more publicly, and am open to any and all feedback!

https://www.austinwallace.ca/survey

And also @aella I do have a couple of questions and have DM'd you

Russell Williams's avatar

My suggestion would be to prominently show Aella's specific caveats about the anonymization: the magnitude of effect on correlations and the Logan's Run-like age truncation. I'd put it on the front page under "About the data" and on the "Open data quality" page, and maybe even add specific cautions about queries that the tool makes easy but which the anonymization significantly affects. I applaud her caution in anonymizing the data, and appreciate your generosity and effort in implementing the explorer tool. And I anticipate a flood of posts making spurious claims.

Austin Wallace's avatar

Thank you for the feedback! I think since I don’t know all of the details of the transformations the best thing to do would be to maybe directly quote Aella’s language from this post. However I am open to suggestions!

Austin Wallace's avatar

I did update my language to mirror more directly Aella’s, if you want to take a look :)

Shoni's avatar

I'd love to see it! Can you send me a link when you have one?

Austin Wallace's avatar

Sent you the link in dm!

RJ's avatar

gave this dataset to Claude code and got some pretty amazing results and generated polished infographics. what a rich dataset, thank you!

The Decadent Fool's avatar

Thank you for this Aella -- you're doing God's work!! I'm super excited to dig into the data!

I'm a political economist (postdoc at Harvard rn) and I've been playing with some ideas about how power in sex/kink is rooted in or shapes power in society. I have a ton of interesting hypotheses but I saw that variables like country of origin and race/ethnicity are not in the data (probably for a good reason) which prevent me to try and get some empirical meat on my ideas.

For example, I've been going through Berezkin's folklore and mythology catalog, and noticed that there are many "kinky" folk motifs, stuff like toothed vaginas or using a penis to cross a bridge. I'd be curious to see how specific aspects of culture embodied by these motifs correlate with kinks that you have recorded, but I'd need to map both surveys to regions, if broadly. I think there's a lot to this but there almost no theorizing or data (until now!).

Is it possible to at some point get access to more variables like country of origin? Happy to talk more about the ideas I'm trying to test or anything else...thanks again!

Aella's avatar

Hey, if you email a more specific proposal that outlines exactly what information you need, i can probably get you a subset for this. Make sure you check the original survey file (in the folder i linked) to read through all of the available questions and how they were worded so you can know which ones you're asking for.

Matthew's avatar

So I took stab at throwing this into Gemini and it said "This analysis environment lacks the specific library (pyarrow or fastparquet) required to read .parquet files.

Action Required: To proceed with analysis, please convert this file to a .csv format and upload it again." While I can probably figure this out a csv file for download may make it easier for some people.

Aella's avatar

Ah I assumed that the ais would just solve it. I'll upload a csv later today

Austin Wallace's avatar

A computer-access AI like Claude Code or Codex will have a better time with Parquet, but web-based AI might not depending on which one it is.

I've converted to CSV and uploaded here @Matthew

https://drive.google.com/file/d/1uuNZFRUi4fVFKxChbudmRTup8ZQv7roT/view?usp=sharing

Aella's avatar

Ty! I also have updated the link in the post to go to a folder which contains both csv and parquet.

Samuel Gild's avatar

Wow I would love to see some analysis of this data, or visualisation. Fascinated already.

Nazar Androshchuk's avatar

Search for Aella’s own analysis a few years back

Azure's avatar

I'm interested in looking at the relationship between some of the spanking-related questions. A couple things I noticed:

-It looks like p957nyk (How often were you spanked as a form of discipline) was collapsed from 5 levels to 3 in BKSPublic.csv, but don't see it mentioned in the column modification notes. I would be interested in seeing how these were binned.

-There is no column for "spankpainlevelchildhood" in BKSPublic.csv

-There is no column for "spankedfrequency" in BKSPublic.csv.

Anonymous Dude's avatar

Hey Aella and (Only)fans, I have a conspiracy theory I'd like to see y'all weigh in on.

The divide between anti-sex and pro-sex feminists is at least in part the divide between tendersexual and bdsmsexual feminits.

Let me know what you think!

Vincent's avatar

I hope no one is really into hotwifing. Else that would be wrong.

Jason Brinkley's avatar

What a neat dataset for stats nerds to play with. I opened the files and started doing some additional cleaning and organization as well. I think a few things can help speed things along if you want to make some additional effort:

1. You can streamline the files by numbering questions and then including a version of the data with the variables ordered numerically as they occur in the survey. Can be useful for things.

2. Someone can make you a data dictionary for the data that has basic counts and relative frequencies. That way numbers nerd who don't work with data can at least look at the marginal results for the entire survey.

3. After organizing the main file, I would also reorganize a version with the open text variables at either the front or the back end. Basically, I would dump all the variables that have more than 10-12 categories at the end cause they are going to make any modeling a challenge. There is so much here to explore just in the categorical and quant data that you don't need much else.

4. Instead of a formal journal article, you can start an article on arXiv.org with select results and then upload the data to that working paper. In fact, anyone here that does an interesting analysis with the data can put their results on arXiv with an ok from you, or just reference this blog.

5. I started playing with the biological male data and it is really neat stuff. I found really cool associations between biomale and

'In general, I prefer when the person in bondage is' - Someone Else (80% men), Me (25% men)

Clothing score - associations go in polar opposite directions for men versus women

Which describes you best? (cvc5b81) - rates go in opposite directions between men and women

6. Finally, be careful all who work with the age and count variables. They will tend to confound with one another. People who are older typically have more experiences so you get higher numbers just from that. This is exactly the kind of place where we might think of 'age adjusted' health analyses. Typical statistical methods do not always fully adjust for these kinds of associations, it isn't like you can just run a regression with age and use the residual results. Not when the associations run this deep.

Matthew's avatar

Thanks for posting. I've been playing with in Gemini once I got it loaded. I made some "research" papers from it given my sarcasm of the academic research community as you expressed https://aella.substack.com/p/me-vs-the-entire-field-of-fetish. One of the better ones is here. https://1drv.ms/b/c/1dbe5e06885602de/IQCNU8TIT9ztSI7HHDz_dF2CAcEnJWZzQDOyZfMLJJjw3wM?e=jgjqss I am both impressed with AI and it clearly has limits. I question how accurate weibull distributions on binned data truly can be. Thanks for giving me a data set to learn about AI with.

Kasia Zaniewska's avatar

"I did limit the sample from the very beginning to be ages 14-32, and responses from western countries only (US/Canada and Europe)"

Could you share more context on that choice, or point me to where I can find it if it was previously posted?

Aella's avatar

basically, i just had way fewer older ersponses, and there wasn't a clean/simple way to downsample correctly while maintaining anonymity for older ppl, so it was a tradeoff on age bracket and sample size. western countries is mostly for balancing representativeness; i feel more confident about general political distributions in western countries than i do in like india for example.

Fujimura's avatar

An alternative way to approach representativeness, if it would be more convenient in the future, would not be to directly downsample or upsample respondents in the data, but just to include survey weights along with the data. Then you wouldn't need to worry about constructing a dataset that is itself representative our of a dataset with very few older respondents, you could just include more 'raw' data, with survey weights reflecting the under-representation of older people (i.e. the weights column will have higher numbers for the under-represented older people).

This is one example (https://rpubs.com/DACSS_Prof/1244160), but your LLM of choice would also be able to do it.

Aella's avatar

I've done this with the weighted version of my dataset on bigkinksurvey.com, but I believe releasing raw data with weights still allows the ability to view the raw data, and that's the rough part here

MLP's avatar

So cool, thanks for doing this!