GPT-4chan Model Card

GPT-4chan

Note on Torrents, etc.

Note that I have no association with any torrents or backups, or other ways of obtaining this model. However, if you try them, please be safe. Here are the hex md5 hashes for the pytorch_model.bin files:

pytorch_model.bin float32 : 833c1dc19b7450e4e559a9917b7d076a

pytorch_model.bin float16 : db3105866c9563b26f7399fafc00bb4b

And here are the SHA256 hashes:

pytorch_model.bin float32 : fc396bb082401c3c10daa1f0174d10782d95218181a8a6994f6112eb09d5a7e2

pytorch_model.bin float16 : 04195ed0feff23998edc1c486b7504a0c38c75c828018ed2743e0fa7ef4cb1df

Model Description

GPT-4chan is a language model fine-tuned from GPT-J 6B on 3.5 years worth of data from 4chan's politically incorrect (/pol/) board.

Training data

GPT-4chan was fine-tuned on the dataset Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board.

Training procedure

The model was trained for 1 epoch following GPT-J's fine-tuning guide.

Intended Use

GPT-4chan is trained on anonymously posted and sparsely moderated discussions of political topics. Its intended use is to reproduce text according to the distribution of its input data. It may also be a useful tool to investigate discourse in such anonymous online communities. Lastly, it has potential applications in tasks such as toxicity detection, as initial experiments show promising zero-shot results when comparing a string's likelihood under GPT-4chan to its likelihood under GPT-J 6B.

How to use

The following is copied from the Hugging Face documentation on GPT-J. Refer to the original for more details.

For inference parameters, we recommend a temperature of 0.8, along with either a top_p of 0.8 or a typical_p of 0.3.

For the float32 model (CPU):

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ykilcher/gpt-4chan")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

prompt = (
    "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
    "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
    "researchers was the fact that the unicorns spoke perfect English."
)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

gen_tokens = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
    max_length=100,
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]

For the float16 model (GPU):

from transformers import GPTJForCausalLM, AutoTokenizer
import torch

from transformers import GPTJForCausalLM
import torch

model = GPTJForCausalLM.from_pretrained(
    "ykilcher/gpt-4chan", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
)
model.cuda()
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

prompt = (
    "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
    "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
    "researchers was the fact that the unicorns spoke perfect English."
)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.cuda()

gen_tokens = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
    max_length=100,
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]

Limitations and Biases

This is a statistical model. As such, it continues text as is likely under the distribution the model has learned from the training data. Outputs should not be interpreted as "correct", "truthful", or otherwise as anything more than a statistical function of the input. That being said, GPT-4chan does significantly outperform GPT-J (and GPT-3) on the TruthfulQA Benchmark that measures whether a language model is truthful in generating answers to questions.

The dataset is time- and domain-limited. It was collected from 2016 to 2019 on 4chan's politically incorrect board. As such, political topics from that area will be overrepresented in the model's distribution, compared to other models (e.g. GPT-J 6B). Also, due to the very lax rules and anonymity of posters, a large part of the dataset contains offensive material. Thus, it is very likely that the model will produce offensive outputs, including but not limited to: toxicity, hate speech, racism, sexism, homo- and transphobia, xenophobia, and anti-semitism.

Due to the above limitations, it is strongly recommend to not deploy this model into a real-world environment unless its behavior is well-understood and explicit and strict limitations on the scope, impact, and duration of the deployment are enforced.

Evaluation results

Language Model Evaluation Harness

The following table compares GPT-J 6B to GPT-4chan on a subset of the Language Model Evaluation Harness. Differences exceeding standard errors are marked in the "Significant" column with a minus sign (-) indicating an advantage for GPT-J 6B and a plus sign (+) indicating an advantage for GPT-4chan.

TaskMetricGPT-4chanstderrGPT-J-6BstderrSignificant
copaacc0.850.0358870.830.0377525
blimp_only_npi_scopeacc0.7120.01432690.7870.0129537-
hendrycksTest-conceptual_physicsacc0.2510640.0283470.2553190.0285049
hendrycksTest-conceptual_physicsacc_norm0.1872340.02550160.1914890.0257221
hendrycksTest-high_school_mathematicsacc0.2481480.02633570.2185190.0251958+
hendrycksTest-high_school_mathematicsacc_norm0.30.02794050.2518520.0264661+
blimp_sentential_negation_npi_scopeacc0.7340.013980.7330.0139967
hendrycksTest-high_school_european_historyacc0.2787880.03501440.2606060.0342774
hendrycksTest-high_school_european_historyacc_norm0.3151520.03627730.2787880.0350144+
blimp_wh_questions_object_gapacc0.8410.01156950.8350.0117436
hendrycksTest-international_lawacc0.2148760.03749490.2644630.0402619-
hendrycksTest-international_lawacc_norm0.4380170.04529150.4049590.0448114
hendrycksTest-high_school_us_historyacc0.3235290.03283470.2892160.0318223+
hendrycksTest-high_school_us_historyacc_norm0.3235290.03283470.299020.0321333
openbookqaacc0.2760.02001120.290.0203132
openbookqaacc_norm0.3620.02151370.3820.0217508
blimp_causativeacc0.7370.01392930.7610.013493-
recordf10.8784430.003223940.8850490.00314367-
recordem0.87020.0033610.87650.00329027-
blimp_determiner_noun_agreement_1acc0.9960.001996990.9950.00223159
hendrycksTest-miscellaneousacc0.3052360.01646770.2745850.0159598+
hendrycksTest-miscellaneousacc_norm0.2694760.01586620.2605360.015696
hendrycksTest-virologyacc0.3433730.03696580.3493980.0371173
hendrycksTest-virologyacc_norm0.3313250.03664310.3253010.0364717
mathqaacc0.2690120.008117860.2670020.00809858
mathqaacc_norm0.2616420.008046140.2706870.00813376-
squad2exact10.6123010.62070-
squad2f117.8734017.74130+
squad2HasAns_exact17.2571015.50270+
squad2HasAns_f131.8029.76430+
squad2NoAns_exact3.9865405.752730-
squad2NoAns_f13.9865405.752730-
squad2best_exact50.0716050.07160
squad2best_f150.077050.07780-
mnli_mismatchedacc0.3205860.004706960.3766270.00488687-
blimp_animate_subject_passiveacc0.790.01288670.7810.0130847
blimp_determiner_noun_agreement_with_adj_irregular_1acc0.8340.01177210.8780.0103549-
qnliacc0.4913050.006764390.5134540.00676296-
blimp_intransitiveacc0.8060.01251080.8580.0110435-
ethics_cmacc0.5122270.008020480.5598460.00796521-
hendrycksTest-high_school_computer_scienceacc0.20.04020150.250.0435194-
hendrycksTest-high_school_computer_scienceacc_norm0.260.04408440.270.0446196
iwslt17-ar-enbleu21.46850.6482520.73220.795602+
iwslt17-ar-enchrf0.4521750.004980120.4509190.00526515
iwslt17-ar-enter0.7335140.02016880.7876310.0285488+
hendrycksTest-security_studiesacc0.3918370.03125130.3632650.0307891
hendrycksTest-security_studiesacc_norm0.2857140.02892060.2857140.0289206
hendrycksTest-global_factsacc0.290.04560480.250.0435194
hendrycksTest-global_factsacc_norm0.260.04408440.220.0416333
anli_r1acc0.2970.01445680.3220.0147829-
blimp_left_branch_island_simple_questionacc0.8840.01013150.8670.0107437+
hendrycksTest-astronomyacc0.250.03523810.250.0352381
hendrycksTest-astronomyacc_norm0.3486840.03878140.3355260.038425
mrpcacc0.5367650.0247170.6838240.0230483-
mrpcf10.633010.02479850.8122270.0162476-
ethics_utilitarianismacc0.5253740.007202330.5097750.00721024+
blimp_determiner_noun_agreement_2acc0.990.0031480.9770.00474273+
lambada_clozeppl388.12313.1523405.64614.5519+
lambada_clozeacc0.01164370.001494560.01998840.00194992-
truthfulqa_mcmc10.2252140.01462320.2019580.014054+
truthfulqa_mcmc20.3716250.01365580.3595370.0134598
blimp_wh_vs_that_with_gap_long_distanceacc0.4410.01570880.3420.0150087+
hendrycksTest-business_ethicsacc0.280.04512610.290.0456048
hendrycksTest-business_ethicsacc_norm0.290.04560480.30.0460566
arithmetic_3dsacc0.00650.001797360.0460.0046854-
blimp_determiner_noun_agreement_with_adjective_1acc0.9880.003444980.9780.00464086+
hendrycksTest-moral_disputesacc0.2774570.02410570.2832370.0242579
hendrycksTest-moral_disputesacc_norm0.3092490.02488310.326590.0252483
arithmetic_2daacc0.04550.004661090.24050.00955906-
qa4mre_2011acc0.4250.04531630.4583330.0456755
qa4mre_2011acc_norm0.5583330.04552190.5333330.045733
blimp_regular_plural_subject_verb_agreement_1acc0.9660.005733840.9680.00556839
hendrycksTest-human_sexualityacc0.3893130.04276490.3969470.0429114
hendrycksTest-human_sexualityacc_norm0.3053440.04039310.3435110.0416498
blimp_passive_1acc0.8780.01035490.8850.0100934
blimp_drop_argumentacc0.7840.01301970.8230.0120755-
hendrycksTest-high_school_microeconomicsacc0.2605040.02851030.2773110.0290794
hendrycksTest-high_school_microeconomicsacc_norm0.3907560.03169380.399160.0318111
hendrycksTest-us_foreign_policyacc0.320.04688260.340.0476095
hendrycksTest-us_foreign_policyacc_norm0.40.04923660.350.0479372+
blimp_ellipsis_n_bar_1acc0.8460.01141990.8410.0115695
hendrycksTest-high_school_physicsacc0.2649010.03603040.2715230.0363133
hendrycksTest-high_school_physicsacc_norm0.2847680.03684880.2715230.0363133
qa4mre_2013acc0.3626760.0285790.4014080.0291384-
qa4mre_2013acc_norm0.3873240.02895740.3838030.0289082
blimp_wh_vs_that_no_gapacc0.9630.005972160.9690.00548353-
headqa_esacc0.2388770.008144420.2512760.0082848-
headqa_esacc_norm0.2906640.008672950.2866520.00863721
blimp_sentential_subject_islandacc0.3590.01517730.4210.0156206-
hendrycksTest-philosophyacc0.2411580.02429660.260450.0249267
hendrycksTest-philosophyacc_norm0.3279740.02666440.3344050.0267954
hendrycksTest-elementary_mathematicsacc0.2486770.02226180.2513230.0223405
hendrycksTest-elementary_mathematicsacc_norm0.2751320.02300010.264550.0227175
math_geometryacc0.01878910.006210420.01043840.00464863+
blimp_wh_questions_subject_gap_long_distanceacc0.8860.01005510.8830.0101693
hendrycksTest-college_physicsacc0.2058820.04023380.2058820.0402338
hendrycksTest-college_physicsacc_norm0.225490.04158310.2450980.0428011
hellaswagacc0.4887470.004988520.495320.00498956-
hellaswagacc_norm0.6482770.004765320.662020.00472055-
hendrycksTest-logical_fallaciesacc0.2699390.03487830.2944790.0358117
hendrycksTest-logical_fallaciesacc_norm0.3435580.03731130.3558280.0376152
hendrycksTest-machine_learningacc0.3392860.04493950.2232140.039523+
hendrycksTest-machine_learningacc_norm0.2053570.03834240.1785710.0363521
hendrycksTest-high_school_psychologyacc0.2862390.01937940.2733940.0191093
hendrycksTest-high_school_psychologyacc_norm0.2660550.0189460.2697250.0190285
prostacc0.2562980.003189670.2682540.00323688-
prostacc_norm0.2801560.003280890.2746580.00326093+
blimp_determiner_noun_agreement_with_adj_irregular_2acc0.8980.009575370.9160.00877616-
wnliacc0.436620.05927940.4647890.0596131
hendrycksTest-professional_lawacc0.2848760.01152780.2737940.0113886
hendrycksTest-professional_lawacc_norm0.3018250.01172440.2926990.0116209
math_algebraacc0.01263690.003243520.01179440.00313487
wikitextword_perplexity11.4687010.88190-
wikitextbyte_perplexity1.578101.562680-
wikitextbits_per_byte0.65818800.6440190-
anagrams1acc0.01250.001111080.00080.000282744+
math_prealgebraacc0.01951780.004690030.01262920.00378589+
blimp_principle_A_domain_2acc0.8870.01001660.8890.0099387
cycle_lettersacc0.03310.001789070.00260.000509264+
hendrycksTest-college_mathematicsacc0.260.04408440.260.0440844
hendrycksTest-college_mathematicsacc_norm0.310.04648230.40.0492366-
arithmetic_1dcacc0.0770.005962660.0890.00636866-
arithmetic_4daacc0.00050.00050.0070.00186474-
triviaqaacc0.1508880.003365430.1674180.00351031-
boolqacc0.6733940.008202360.6553520.00831224+
random_insertionacc0.00040.0001999700+
qa4mre_2012acc0.40.03885140.41250.0390407
qa4mre_2012acc_norm0.46250.03954090.506250.0396495-
math_asdivacc0.009978310.002070660.005639910.00156015+
hendrycksTest-moral_scenariosacc0.2368720.01421960.2368720.0142196
hendrycksTest-moral_scenariosacc_norm0.2726260.01489340.2726260.0148934
hendrycksTest-high_school_geographyacc0.2474750.03074630.202020.0286062+
hendrycksTest-high_school_geographyacc_norm0.2878790.03225880.2929290.032425
gsm8kacc0000
blimp_existential_there_object_raisingacc0.8120.01236160.7920.0128414+
blimp_superlative_quantifiers_2acc0.9170.008728530.8650.0108117+
hendrycksTest-college_chemistryacc0.280.04512610.240.0429235
hendrycksTest-college_chemistryacc_norm0.310.04648230.280.0451261
blimp_existential_there_quantifiers_2acc0.5450.01575510.3830.0153801+
hendrycksTest-abstract_algebraacc0.170.03775250.260.0440844-
hendrycksTest-abstract_algebraacc_norm0.260.04408440.30.0460566
hendrycksTest-professional_psychologyacc0.266340.01788320.282680.0182173
hendrycksTest-professional_psychologyacc_norm0.2565360.01766780.2598040.0177409
ethics_virtueacc0.2498490.006138470.2002010.00567376+
ethics_virtueem0.0040201000+
arithmetic_5daacc000.00050.0005-
mutualr@10.4559820.01674210.4683970.0167737
mutualr@20.7325060.01487960.7358920.0148193
mutualmrr0.6752260.01031320.6821860.0103375
blimp_irregular_past_participle_verbsacc0.8690.01067490.8760.0104275
ethics_deontologyacc0.4977750.008339040.5236370.0083298-
ethics_deontologyem0.0033370400.03559510-
blimp_transitiveacc0.8180.01220760.8550.01114-
hendrycksTest-college_computer_scienceacc0.290.04560480.270.0446196
hendrycksTest-college_computer_scienceacc_norm0.270.04461960.260.0440844
hendrycksTest-professional_medicineacc0.2830880.02736590.2720590.027033
hendrycksTest-professional_medicineacc_norm0.2794120.02725720.2610290.0266793
sciqacc0.8950.009698920.9150.00882343-
sciqacc_norm0.8690.01067490.8740.0104992
blimp_anaphor_number_agreementacc0.9930.002637790.9950.00223159
blimp_wh_questions_subject_gapacc0.9250.008333330.9130.00891687+
blimp_wh_vs_that_with_gapacc0.4820.0158090.4290.015659+
math_num_theoryacc0.03518520.007936110.02037040.00608466+
blimp_complex_NP_islandacc0.5380.01577350.5350.0157805
blimp_expletive_it_object_raisingacc0.7770.01316980.780.0131062
lambada_mt_enppl4.625040.105494.102240.0884971-
lambada_mt_enacc0.6485540.006651420.6821270.00648741-
hendrycksTest-formal_logicacc0.3095240.04134910.341270.042408
hendrycksTest-formal_logicacc_norm0.3253970.0419060.3253970.041906
blimp_matrix_question_npi_licensor_presentacc0.6630.01495510.7270.014095-
blimp_superlative_quantifiers_1acc0.7910.01286410.8710.0106053-
lambada_mt_deppl89.79055.3030182.24164.88447-
lambada_mt_deacc0.3122450.00645620.3128270.00645948
hendrycksTest-computer_securityacc0.370.04852370.270.0446196+
hendrycksTest-computer_securityacc_norm0.370.04852370.330.0472582
ethics_justiceacc0.5014790.009617120.5266270.00960352-
ethics_justiceem000.02514790-
blimp_principle_A_reconstructionacc0.2960.01444270.4440.0157198-
blimp_existential_there_subject_raisingacc0.8770.01039130.8750.0104635
math_precalcacc0.0146520.005146890.00183150.0018315+
qasperf1_yesno0.6329970.0328680.6666670.0311266-
qasperf1_abstractive0.1134890.007290730.1183830.00692993
cbacc0.1964290.05357140.3571430.0646096-
cbf10.14903800.2881090-
blimp_animate_subject_transacc0.8580.01104350.8680.0107094
hendrycksTest-high_school_statisticsacc0.3101850.0315470.2916670.0309987
hendrycksTest-high_school_statisticsacc_norm0.3611110.03275770.3148150.0316747+
blimp_irregular_plural_subject_verb_agreement_2acc0.8810.01024420.9190.00863212-
lambada_mt_esppl92.11725.0506483.66964.57489-
lambada_mt_esacc0.3223370.006511390.3269940.00653569
anli_r2acc0.3270.01484220.3370.0149551
hendrycksTest-nutritionacc0.3464050.02724560.3464050.0272456
hendrycksTest-nutritionacc_norm0.3856210.02787070.4019610.0280742
anli_r3acc0.3366670.01364760.35250.0137972-
blimp_regular_plural_subject_verb_agreement_2acc0.8970.009616830.9160.00877616-
blimp_tough_vs_raising_2acc0.8260.01199450.8570.0110758-
mnliacc0.3160470.004693170.3747330.00488619-
dropem0.05956380.002423790.02286070.0015306+
dropf10.1203550.002709510.1038710.00219977+
blimp_determiner_noun_agreement_with_adj_2acc0.950.006895470.9360.00774364+
arithmetic_2dmacc0.0610.005352930.140.00776081-
blimp_determiner_noun_agreement_irregular_2acc0.930.008072490.9320.00796489
lambadappl4.625040.105494.102240.0884971-
lambadaacc0.6485540.006651420.6821270.00648741-
arithmetic_3daacc0.0070.001864740.08650.00628718-
blimp_irregular_past_participle_adjectivesacc0.9470.007088110.9560.00648892-
hendrycksTest-college_biologyacc0.2013890.03353650.2847220.0377381-
hendrycksTest-college_biologyacc_norm0.2222220.03476590.2708330.0371618-
headqa_enacc0.3249450.008945820.3355220.00901875-
headqa_enacc_norm0.3756380.009250140.3832970.00928648
blimp_determiner_noun_agreement_irregular_1acc0.9120.008963050.9440.0072744-
blimp_existential_there_quantifiers_1acc0.9850.003845750.9810.00431945
blimp_inchoativeacc0.6530.01506050.6830.0147217-
mutual_plusr@10.3950340.01643280.4097070.016531
mutual_plusr@20.6749440.0157450.6805870.0156728
mutual_plusmrr0.6327130.01033910.6408010.0104141
blimp_tough_vs_raising_1acc0.7360.01394630.7340.01398
winograndeacc0.6361480.01352150.6408840.0134831
raceacc0.3741630.01497650.375120.0149842
blimp_irregular_plural_subject_verb_agreement_1acc0.9080.009144380.9180.00868052-
hendrycksTest-high_school_macroeconomicsacc0.2846150.02287830.2846150.0228783
hendrycksTest-high_school_macroeconomicsacc_norm0.2846150.02287830.2769230.022688
blimp_adjunct_islandacc0.8880.009977750.9020.00940662-
hendrycksTest-high_school_chemistryacc0.2364530.02989610.2118230.028749
hendrycksTest-high_school_chemistryacc_norm0.3004930.0322580.290640.0319474
arithmetic_2dsacc0.0510.004920530.2180.00923475-
blimp_principle_A_case_2acc0.9550.006558810.9530.00669596
blimp_only_npi_licensor_presentacc0.9260.008282060.9530.00669596-
math_counting_and_probacc0.02742620.007509540.00210970.0021097+
colamcc-0.08542560.0304519-0.05045080.0251594-
webqsacc0.0236220.003369870.02263780.00330058
arithmetic_4dsacc0.00050.00050.00550.00165416-
blimp_wh_vs_that_no_gap_long_distanceacc0.940.007513750.9390.00757208
pile_bookcorpus2word_perplexity28.7786027.05590-
pile_bookcorpus2byte_perplexity1.7996901.780370-
pile_bookcorpus2bits_per_byte0.84775100.8321760-
blimp_sentential_negation_npi_licensor_presentacc0.9940.002443350.9820.00420639+
hendrycksTest-high_school_government_and_politicsacc0.2746110.03221020.2279790.0302769+
hendrycksTest-high_school_government_and_politicsacc_norm0.2590670.03161880.2487050.0311958
blimp_ellipsis_n_bar_2acc0.9370.007687010.9160.00877616+
hendrycksTest-clinical_knowledgeacc0.2830190.02772420.2679250.0272573
hendrycksTest-clinical_knowledgeacc_norm0.3433960.02922450.3169810.0286372
mc_tacoem0.12537500.1328830-
mc_tacof10.48713100.4997120-
wscacc0.3653850.04744730.3653850.0474473
hendrycksTest-college_medicineacc0.2312140.03214740.1907510.0299579+
hendrycksTest-college_medicineacc_norm0.2890170.03456430.2658960.0336876
hendrycksTest-high_school_world_historyacc0.2953590.02969630.28270.0293128
hendrycksTest-high_school_world_historyacc_norm0.3122360.03016510.3122360.0301651
hendrycksTest-anatomyacc0.2962960.03944620.2814810.03885
hendrycksTest-anatomyacc_norm0.2888890.03915450.2666670.0382017
hendrycksTest-jurisprudenceacc0.250.04186090.2777780.0433004
hendrycksTest-jurisprudenceacc_norm0.4166670.04766080.4259260.0478034
logiqaacc0.1935480.01549630.2119820.016031-
logiqaacc_norm0.2811060.01763240.2918590.0178316
ethics_utilitarianism_originalacc0.7676790.006091120.9415560.00338343-
blimp_principle_A_c_commandacc0.8270.01196720.810.0124119+
blimp_coordinate_structure_constraint_complex_left_branchacc0.7940.01279560.7640.0134345+
arithmetic_5dsacc0000
lambada_mt_itppl96.88465.8090286.665.1869-
lambada_mt_itacc0.3281580.006541650.3368910.0065849-
wsc273acc0.8278390.02289050.8278390.0228905
blimp_coordinate_structure_constraint_object_extractionacc0.8520.01123490.8760.0104275-
blimp_principle_A_domain_3acc0.790.01288670.8190.0121814-
blimp_left_branch_island_echo_questionacc0.6380.01520480.5190.0158079+
rteacc0.5342960.03002560.5487360.0299531
blimp_passive_2acc0.8920.009820.8990.00953362
hendrycksTest-electrical_engineeringacc0.3448280.03960930.3586210.0399663
hendrycksTest-electrical_engineeringacc_norm0.3724140.04028730.3724140.0402873
sstacc0.6261470.01639380.4931190.0169402+
blimp_npi_present_1acc0.5650.01568510.5760.0156355
piqaacc0.7393910.01024180.7540810.0100473-
piqaacc_norm0.7551690.01003230.7616970.00994033
hendrycksTest-professional_accountingacc0.3120570.02764010.2659570.0263581+
hendrycksTest-professional_accountingacc_norm0.273050.02657790.226950.0249871+
arc_challengeacc0.3250850.01368810.3378840.013822
arc_challengeacc_norm0.3523890.01396010.3660410.0140772
hendrycksTest-econometricsacc0.2631580.04142440.2456140.0404934
hendrycksTest-econometricsacc_norm0.2543860.04096990.271930.0418577
headqaacc0.2388770.008144420.2512760.0082848-
headqaacc_norm0.2906640.008672950.2866520.00863721
wicacc0.4827590.01979890.50.0198107
hendrycksTest-high_school_biologyacc0.2709680.02528440.2516130.024686
hendrycksTest-high_school_biologyacc_norm0.2741940.02537810.2838710.0256494
hendrycksTest-managementacc0.2815530.04453250.233010.0418583+
hendrycksTest-managementacc_norm0.2912620.04498680.3203880.0462028
blimp_npi_present_2acc0.6450.01513950.6640.0149441-
hendrycksTest-prehistoryacc0.2654320.02456920.2438270.0238919
hendrycksTest-prehistoryacc_norm0.2253090.02324620.2191360.0230167
hendrycksTest-world_religionsacc0.3216370.03582530.3333330.0361551
hendrycksTest-world_religionsacc_norm0.3976610.03753640.3801170.0372297
math_intermediate_algebraacc0.009966780.003307490.003322260.00191598+
anagrams2acc0.03470.001830280.00550.000739615+
arc_easyacc0.6473060.009804420.6696130.00965143-
arc_easyacc_norm0.6098480.01000910.6228960.00994504-
blimp_anaphor_gender_agreementacc0.9930.002637790.9940.00244335
hendrycksTest-marketingacc0.3119660.03035150.3076920.0302364
hendrycksTest-marketingacc_norm0.341880.0310750.2948720.0298726+
blimp_principle_A_domain_1acc0.9970.001730320.9970.00173032
blimp_wh_islandacc0.8560.0111080.8520.0112349
hendrycksTest-sociologyacc0.3034830.03251010.2786070.0317006
hendrycksTest-sociologyacc_norm0.2985070.03235740.3184080.0329412
blimp_distractor_agreement_relative_clauseacc0.7740.01323250.7190.0142212+
truthfulqa_genbleurt_max-0.8116550.0180743-0.8142280.0172128
truthfulqa_genbleurt_acc0.3953490.01711580.3292530.0164513+
truthfulqa_genbleurt_diff-0.04883850.0204525-0.1859050.0169617+
truthfulqa_genbleu_max20.87470.71700320.22380.711772
truthfulqa_genbleu_acc0.3304770.01646680.2815180.015744+
truthfulqa_genbleu_diff-2.128560.832693-6.661210.719366+
truthfulqa_genrouge1_max47.02930.96240445.34570.89238+
truthfulqa_genrouge1_acc0.3414930.01660070.2570380.0152981+
truthfulqa_genrouge1_diff-2.294541.2086-10.10490.8922+
truthfulqa_genrouge2_max31.06171.0872528.74380.981282+
truthfulqa_genrouge2_acc0.2472460.01510240.2019580.014054+
truthfulqa_genrouge2_diff-2.840211.28749-11.09161.01664+
truthfulqa_genrougeL_max44.64630.96611942.61160.893252+
truthfulqa_genrougeL_acc0.3341490.01651250.242350.0150007+
truthfulqa_genrougeL_diff-2.508531.22016-10.42990.904205+
hendrycksTest-public_relationsacc0.30.04389310.2818180.0430912
hendrycksTest-public_relationsacc_norm0.1909090.03764430.1636360.0354343
blimp_distractor_agreement_relational_nounacc0.8590.01101090.8330.0118004+
lambada_mt_frppl57.03793.1571951.73132.90272-
lambada_mt_fracc0.3885120.00679060.409470.00685084-
blimp_principle_A_case_1acc1010
hendrycksTest-medical_geneticsacc0.370.04852370.310.0464823+
hendrycksTest-medical_geneticsacc_norm0.410.04943110.390.0490207
qqpacc0.3643830.002393480.3836260.00241841-
qqpf10.5163910.002636740.4512220.00289696+
iwslt17-en-arbleu2.355630.1886384.982250.275369-
iwslt17-en-archrf0.1409120.005031010.2777080.00415432-
iwslt17-en-arter1.09090.01221110.9547010.0126737-
multircacc0.04092340.006420870.01783840.00428994+
hendrycksTest-human_agingacc0.2645740.02960510.2645740.0296051
hendrycksTest-human_agingacc_norm0.1973090.02670990.2376680.0285681-
reversed_wordsacc0.00030.00017318800+

(Some results are missing due to errors or computational constraints.)