GPT-4chan Model Card

GPT-4chan

Note on Torrents, etc.

Note that I have no association with any torrents or backups, or other ways of obtaining this model. However, if you try them, please be safe. Here are the hex md5 hashes for the pytorch_model.bin files:

pytorch_model.bin float32 : 833c1dc19b7450e4e559a9917b7d076a

pytorch_model.bin float16 : db3105866c9563b26f7399fafc00bb4b

And here are the SHA256 hashes:

pytorch_model.bin float32 : fc396bb082401c3c10daa1f0174d10782d95218181a8a6994f6112eb09d5a7e2

pytorch_model.bin float16 : 04195ed0feff23998edc1c486b7504a0c38c75c828018ed2743e0fa7ef4cb1df

Model Description

GPT-4chan is a language model fine-tuned from GPT-J 6B on 3.5 years worth of data from 4chan's politically incorrect (/pol/) board.

Training data

GPT-4chan was fine-tuned on the dataset Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board.

Training procedure

The model was trained for 1 epoch following GPT-J's fine-tuning guide.

Intended Use

GPT-4chan is trained on anonymously posted and sparsely moderated discussions of political topics. Its intended use is to reproduce text according to the distribution of its input data. It may also be a useful tool to investigate discourse in such anonymous online communities. Lastly, it has potential applications in tasks such as toxicity detection, as initial experiments show promising zero-shot results when comparing a string's likelihood under GPT-4chan to its likelihood under GPT-J 6B.

How to use

The following is copied from the Hugging Face documentation on GPT-J. Refer to the original for more details.

For inference parameters, we recommend a temperature of 0.8, along with either a top_p of 0.8 or a typical_p of 0.3.

For the float32 model (CPU):

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ykilcher/gpt-4chan")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

prompt = (
    "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
    "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
    "researchers was the fact that the unicorns spoke perfect English."
)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

gen_tokens = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
    max_length=100,
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]

For the float16 model (GPU):

from transformers import GPTJForCausalLM, AutoTokenizer
import torch

from transformers import GPTJForCausalLM
import torch

model = GPTJForCausalLM.from_pretrained(
    "ykilcher/gpt-4chan", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
)
model.cuda()
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

prompt = (
    "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
    "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
    "researchers was the fact that the unicorns spoke perfect English."
)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.cuda()

gen_tokens = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
    max_length=100,
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]

Limitations and Biases

This is a statistical model. As such, it continues text as is likely under the distribution the model has learned from the training data. Outputs should not be interpreted as "correct", "truthful", or otherwise as anything more than a statistical function of the input. That being said, GPT-4chan does significantly outperform GPT-J (and GPT-3) on the TruthfulQA Benchmark that measures whether a language model is truthful in generating answers to questions.

The dataset is time- and domain-limited. It was collected from 2016 to 2019 on 4chan's politically incorrect board. As such, political topics from that area will be overrepresented in the model's distribution, compared to other models (e.g. GPT-J 6B). Also, due to the very lax rules and anonymity of posters, a large part of the dataset contains offensive material. Thus, it is very likely that the model will produce offensive outputs, including but not limited to: toxicity, hate speech, racism, sexism, homo- and transphobia, xenophobia, and anti-semitism.

Due to the above limitations, it is strongly recommend to not deploy this model into a real-world environment unless its behavior is well-understood and explicit and strict limitations on the scope, impact, and duration of the deployment are enforced.

Evaluation results

Language Model Evaluation Harness

The following table compares GPT-J 6B to GPT-4chan on a subset of the Language Model Evaluation Harness. Differences exceeding standard errors are marked in the "Significant" column with a minus sign (-) indicating an advantage for GPT-J 6B and a plus sign (+) indicating an advantage for GPT-4chan.

Task	Metric	GPT-4chan	stderr	GPT-J-6B	stderr	Significant
copa	acc	0.85	0.035887	0.83	0.0377525
blimp_only_npi_scope	acc	0.712	0.0143269	0.787	0.0129537	-
hendrycksTest-conceptual_physics	acc	0.251064	0.028347	0.255319	0.0285049
hendrycksTest-conceptual_physics	acc_norm	0.187234	0.0255016	0.191489	0.0257221
hendrycksTest-high_school_mathematics	acc	0.248148	0.0263357	0.218519	0.0251958	+
hendrycksTest-high_school_mathematics	acc_norm	0.3	0.0279405	0.251852	0.0264661	+
blimp_sentential_negation_npi_scope	acc	0.734	0.01398	0.733	0.0139967
hendrycksTest-high_school_european_history	acc	0.278788	0.0350144	0.260606	0.0342774
hendrycksTest-high_school_european_history	acc_norm	0.315152	0.0362773	0.278788	0.0350144	+
blimp_wh_questions_object_gap	acc	0.841	0.0115695	0.835	0.0117436
hendrycksTest-international_law	acc	0.214876	0.0374949	0.264463	0.0402619	-
hendrycksTest-international_law	acc_norm	0.438017	0.0452915	0.404959	0.0448114
hendrycksTest-high_school_us_history	acc	0.323529	0.0328347	0.289216	0.0318223	+
hendrycksTest-high_school_us_history	acc_norm	0.323529	0.0328347	0.29902	0.0321333
openbookqa	acc	0.276	0.0200112	0.29	0.0203132
openbookqa	acc_norm	0.362	0.0215137	0.382	0.0217508
blimp_causative	acc	0.737	0.0139293	0.761	0.013493	-
record	f1	0.878443	0.00322394	0.885049	0.00314367	-
record	em	0.8702	0.003361	0.8765	0.00329027	-
blimp_determiner_noun_agreement_1	acc	0.996	0.00199699	0.995	0.00223159
hendrycksTest-miscellaneous	acc	0.305236	0.0164677	0.274585	0.0159598	+
hendrycksTest-miscellaneous	acc_norm	0.269476	0.0158662	0.260536	0.015696
hendrycksTest-virology	acc	0.343373	0.0369658	0.349398	0.0371173
hendrycksTest-virology	acc_norm	0.331325	0.0366431	0.325301	0.0364717
mathqa	acc	0.269012	0.00811786	0.267002	0.00809858
mathqa	acc_norm	0.261642	0.00804614	0.270687	0.00813376	-
squad2	exact	10.6123	0	10.6207	0	-
squad2	f1	17.8734	0	17.7413	0	+
squad2	HasAns_exact	17.2571	0	15.5027	0	+
squad2	HasAns_f1	31.8	0	29.7643	0	+
squad2	NoAns_exact	3.98654	0	5.75273	0	-
squad2	NoAns_f1	3.98654	0	5.75273	0	-
squad2	best_exact	50.0716	0	50.0716	0
squad2	best_f1	50.077	0	50.0778	0	-
mnli_mismatched	acc	0.320586	0.00470696	0.376627	0.00488687	-
blimp_animate_subject_passive	acc	0.79	0.0128867	0.781	0.0130847
blimp_determiner_noun_agreement_with_adj_irregular_1	acc	0.834	0.0117721	0.878	0.0103549	-
qnli	acc	0.491305	0.00676439	0.513454	0.00676296	-
blimp_intransitive	acc	0.806	0.0125108	0.858	0.0110435	-
ethics_cm	acc	0.512227	0.00802048	0.559846	0.00796521	-
hendrycksTest-high_school_computer_science	acc	0.2	0.0402015	0.25	0.0435194	-
hendrycksTest-high_school_computer_science	acc_norm	0.26	0.0440844	0.27	0.0446196
iwslt17-ar-en	bleu	21.4685	0.64825	20.7322	0.795602	+
iwslt17-ar-en	chrf	0.452175	0.00498012	0.450919	0.00526515
iwslt17-ar-en	ter	0.733514	0.0201688	0.787631	0.0285488	+
hendrycksTest-security_studies	acc	0.391837	0.0312513	0.363265	0.0307891
hendrycksTest-security_studies	acc_norm	0.285714	0.0289206	0.285714	0.0289206
hendrycksTest-global_facts	acc	0.29	0.0456048	0.25	0.0435194
hendrycksTest-global_facts	acc_norm	0.26	0.0440844	0.22	0.0416333
anli_r1	acc	0.297	0.0144568	0.322	0.0147829	-
blimp_left_branch_island_simple_question	acc	0.884	0.0101315	0.867	0.0107437	+
hendrycksTest-astronomy	acc	0.25	0.0352381	0.25	0.0352381
hendrycksTest-astronomy	acc_norm	0.348684	0.0387814	0.335526	0.038425
mrpc	acc	0.536765	0.024717	0.683824	0.0230483	-
mrpc	f1	0.63301	0.0247985	0.812227	0.0162476	-
ethics_utilitarianism	acc	0.525374	0.00720233	0.509775	0.00721024	+
blimp_determiner_noun_agreement_2	acc	0.99	0.003148	0.977	0.00474273	+
lambada_cloze	ppl	388.123	13.1523	405.646	14.5519	+
lambada_cloze	acc	0.0116437	0.00149456	0.0199884	0.00194992	-
truthfulqa_mc	mc1	0.225214	0.0146232	0.201958	0.014054	+
truthfulqa_mc	mc2	0.371625	0.0136558	0.359537	0.0134598
blimp_wh_vs_that_with_gap_long_distance	acc	0.441	0.0157088	0.342	0.0150087	+
hendrycksTest-business_ethics	acc	0.28	0.0451261	0.29	0.0456048
hendrycksTest-business_ethics	acc_norm	0.29	0.0456048	0.3	0.0460566
arithmetic_3ds	acc	0.0065	0.00179736	0.046	0.0046854	-
blimp_determiner_noun_agreement_with_adjective_1	acc	0.988	0.00344498	0.978	0.00464086	+
hendrycksTest-moral_disputes	acc	0.277457	0.0241057	0.283237	0.0242579
hendrycksTest-moral_disputes	acc_norm	0.309249	0.0248831	0.32659	0.0252483
arithmetic_2da	acc	0.0455	0.00466109	0.2405	0.00955906	-
qa4mre_2011	acc	0.425	0.0453163	0.458333	0.0456755
qa4mre_2011	acc_norm	0.558333	0.0455219	0.533333	0.045733
blimp_regular_plural_subject_verb_agreement_1	acc	0.966	0.00573384	0.968	0.00556839
hendrycksTest-human_sexuality	acc	0.389313	0.0427649	0.396947	0.0429114
hendrycksTest-human_sexuality	acc_norm	0.305344	0.0403931	0.343511	0.0416498
blimp_passive_1	acc	0.878	0.0103549	0.885	0.0100934
blimp_drop_argument	acc	0.784	0.0130197	0.823	0.0120755	-
hendrycksTest-high_school_microeconomics	acc	0.260504	0.0285103	0.277311	0.0290794
hendrycksTest-high_school_microeconomics	acc_norm	0.390756	0.0316938	0.39916	0.0318111
hendrycksTest-us_foreign_policy	acc	0.32	0.0468826	0.34	0.0476095
hendrycksTest-us_foreign_policy	acc_norm	0.4	0.0492366	0.35	0.0479372	+
blimp_ellipsis_n_bar_1	acc	0.846	0.0114199	0.841	0.0115695
hendrycksTest-high_school_physics	acc	0.264901	0.0360304	0.271523	0.0363133
hendrycksTest-high_school_physics	acc_norm	0.284768	0.0368488	0.271523	0.0363133
qa4mre_2013	acc	0.362676	0.028579	0.401408	0.0291384	-
qa4mre_2013	acc_norm	0.387324	0.0289574	0.383803	0.0289082
blimp_wh_vs_that_no_gap	acc	0.963	0.00597216	0.969	0.00548353	-
headqa_es	acc	0.238877	0.00814442	0.251276	0.0082848	-
headqa_es	acc_norm	0.290664	0.00867295	0.286652	0.00863721
blimp_sentential_subject_island	acc	0.359	0.0151773	0.421	0.0156206	-
hendrycksTest-philosophy	acc	0.241158	0.0242966	0.26045	0.0249267
hendrycksTest-philosophy	acc_norm	0.327974	0.0266644	0.334405	0.0267954
hendrycksTest-elementary_mathematics	acc	0.248677	0.0222618	0.251323	0.0223405
hendrycksTest-elementary_mathematics	acc_norm	0.275132	0.0230001	0.26455	0.0227175
math_geometry	acc	0.0187891	0.00621042	0.0104384	0.00464863	+
blimp_wh_questions_subject_gap_long_distance	acc	0.886	0.0100551	0.883	0.0101693
hendrycksTest-college_physics	acc	0.205882	0.0402338	0.205882	0.0402338
hendrycksTest-college_physics	acc_norm	0.22549	0.0415831	0.245098	0.0428011
hellaswag	acc	0.488747	0.00498852	0.49532	0.00498956	-
hellaswag	acc_norm	0.648277	0.00476532	0.66202	0.00472055	-
hendrycksTest-logical_fallacies	acc	0.269939	0.0348783	0.294479	0.0358117
hendrycksTest-logical_fallacies	acc_norm	0.343558	0.0373113	0.355828	0.0376152
hendrycksTest-machine_learning	acc	0.339286	0.0449395	0.223214	0.039523	+
hendrycksTest-machine_learning	acc_norm	0.205357	0.0383424	0.178571	0.0363521
hendrycksTest-high_school_psychology	acc	0.286239	0.0193794	0.273394	0.0191093
hendrycksTest-high_school_psychology	acc_norm	0.266055	0.018946	0.269725	0.0190285
prost	acc	0.256298	0.00318967	0.268254	0.00323688	-
prost	acc_norm	0.280156	0.00328089	0.274658	0.00326093	+
blimp_determiner_noun_agreement_with_adj_irregular_2	acc	0.898	0.00957537	0.916	0.00877616	-
wnli	acc	0.43662	0.0592794	0.464789	0.0596131
hendrycksTest-professional_law	acc	0.284876	0.0115278	0.273794	0.0113886
hendrycksTest-professional_law	acc_norm	0.301825	0.0117244	0.292699	0.0116209
math_algebra	acc	0.0126369	0.00324352	0.0117944	0.00313487
wikitext	word_perplexity	11.4687	0	10.8819	0	-
wikitext	byte_perplexity	1.5781	0	1.56268	0	-
wikitext	bits_per_byte	0.658188	0	0.644019	0	-
anagrams1	acc	0.0125	0.00111108	0.0008	0.000282744	+
math_prealgebra	acc	0.0195178	0.00469003	0.0126292	0.00378589	+
blimp_principle_A_domain_2	acc	0.887	0.0100166	0.889	0.0099387
cycle_letters	acc	0.0331	0.00178907	0.0026	0.000509264	+
hendrycksTest-college_mathematics	acc	0.26	0.0440844	0.26	0.0440844
hendrycksTest-college_mathematics	acc_norm	0.31	0.0464823	0.4	0.0492366	-
arithmetic_1dc	acc	0.077	0.00596266	0.089	0.00636866	-
arithmetic_4da	acc	0.0005	0.0005	0.007	0.00186474	-
triviaqa	acc	0.150888	0.00336543	0.167418	0.00351031	-
boolq	acc	0.673394	0.00820236	0.655352	0.00831224	+
random_insertion	acc	0.0004	0.00019997	0	0	+
qa4mre_2012	acc	0.4	0.0388514	0.4125	0.0390407
qa4mre_2012	acc_norm	0.4625	0.0395409	0.50625	0.0396495	-
math_asdiv	acc	0.00997831	0.00207066	0.00563991	0.00156015	+
hendrycksTest-moral_scenarios	acc	0.236872	0.0142196	0.236872	0.0142196
hendrycksTest-moral_scenarios	acc_norm	0.272626	0.0148934	0.272626	0.0148934
hendrycksTest-high_school_geography	acc	0.247475	0.0307463	0.20202	0.0286062	+
hendrycksTest-high_school_geography	acc_norm	0.287879	0.0322588	0.292929	0.032425
gsm8k	acc	0	0	0	0
blimp_existential_there_object_raising	acc	0.812	0.0123616	0.792	0.0128414	+
blimp_superlative_quantifiers_2	acc	0.917	0.00872853	0.865	0.0108117	+
hendrycksTest-college_chemistry	acc	0.28	0.0451261	0.24	0.0429235
hendrycksTest-college_chemistry	acc_norm	0.31	0.0464823	0.28	0.0451261
blimp_existential_there_quantifiers_2	acc	0.545	0.0157551	0.383	0.0153801	+
hendrycksTest-abstract_algebra	acc	0.17	0.0377525	0.26	0.0440844	-
hendrycksTest-abstract_algebra	acc_norm	0.26	0.0440844	0.3	0.0460566
hendrycksTest-professional_psychology	acc	0.26634	0.0178832	0.28268	0.0182173
hendrycksTest-professional_psychology	acc_norm	0.256536	0.0176678	0.259804	0.0177409
ethics_virtue	acc	0.249849	0.00613847	0.200201	0.00567376	+
ethics_virtue	em	0.0040201	0	0	0	+
arithmetic_5da	acc	0	0	0.0005	0.0005	-
mutual	r@1	0.455982	0.0167421	0.468397	0.0167737
mutual	r@2	0.732506	0.0148796	0.735892	0.0148193
mutual	mrr	0.675226	0.0103132	0.682186	0.0103375
blimp_irregular_past_participle_verbs	acc	0.869	0.0106749	0.876	0.0104275
ethics_deontology	acc	0.497775	0.00833904	0.523637	0.0083298	-
ethics_deontology	em	0.00333704	0	0.0355951	0	-
blimp_transitive	acc	0.818	0.0122076	0.855	0.01114	-
hendrycksTest-college_computer_science	acc	0.29	0.0456048	0.27	0.0446196
hendrycksTest-college_computer_science	acc_norm	0.27	0.0446196	0.26	0.0440844
hendrycksTest-professional_medicine	acc	0.283088	0.0273659	0.272059	0.027033
hendrycksTest-professional_medicine	acc_norm	0.279412	0.0272572	0.261029	0.0266793
sciq	acc	0.895	0.00969892	0.915	0.00882343	-
sciq	acc_norm	0.869	0.0106749	0.874	0.0104992
blimp_anaphor_number_agreement	acc	0.993	0.00263779	0.995	0.00223159
blimp_wh_questions_subject_gap	acc	0.925	0.00833333	0.913	0.00891687	+
blimp_wh_vs_that_with_gap	acc	0.482	0.015809	0.429	0.015659	+
math_num_theory	acc	0.0351852	0.00793611	0.0203704	0.00608466	+
blimp_complex_NP_island	acc	0.538	0.0157735	0.535	0.0157805
blimp_expletive_it_object_raising	acc	0.777	0.0131698	0.78	0.0131062
lambada_mt_en	ppl	4.62504	0.10549	4.10224	0.0884971	-
lambada_mt_en	acc	0.648554	0.00665142	0.682127	0.00648741	-
hendrycksTest-formal_logic	acc	0.309524	0.0413491	0.34127	0.042408
hendrycksTest-formal_logic	acc_norm	0.325397	0.041906	0.325397	0.041906
blimp_matrix_question_npi_licensor_present	acc	0.663	0.0149551	0.727	0.014095	-
blimp_superlative_quantifiers_1	acc	0.791	0.0128641	0.871	0.0106053	-
lambada_mt_de	ppl	89.7905	5.30301	82.2416	4.88447	-
lambada_mt_de	acc	0.312245	0.0064562	0.312827	0.00645948
hendrycksTest-computer_security	acc	0.37	0.0485237	0.27	0.0446196	+
hendrycksTest-computer_security	acc_norm	0.37	0.0485237	0.33	0.0472582
ethics_justice	acc	0.501479	0.00961712	0.526627	0.00960352	-
ethics_justice	em	0	0	0.0251479	0	-
blimp_principle_A_reconstruction	acc	0.296	0.0144427	0.444	0.0157198	-
blimp_existential_there_subject_raising	acc	0.877	0.0103913	0.875	0.0104635
math_precalc	acc	0.014652	0.00514689	0.0018315	0.0018315	+
qasper	f1_yesno	0.632997	0.032868	0.666667	0.0311266	-
qasper	f1_abstractive	0.113489	0.00729073	0.118383	0.00692993
cb	acc	0.196429	0.0535714	0.357143	0.0646096	-
cb	f1	0.149038	0	0.288109	0	-
blimp_animate_subject_trans	acc	0.858	0.0110435	0.868	0.0107094
hendrycksTest-high_school_statistics	acc	0.310185	0.031547	0.291667	0.0309987
hendrycksTest-high_school_statistics	acc_norm	0.361111	0.0327577	0.314815	0.0316747	+
blimp_irregular_plural_subject_verb_agreement_2	acc	0.881	0.0102442	0.919	0.00863212	-
lambada_mt_es	ppl	92.1172	5.05064	83.6696	4.57489	-
lambada_mt_es	acc	0.322337	0.00651139	0.326994	0.00653569
anli_r2	acc	0.327	0.0148422	0.337	0.0149551
hendrycksTest-nutrition	acc	0.346405	0.0272456	0.346405	0.0272456
hendrycksTest-nutrition	acc_norm	0.385621	0.0278707	0.401961	0.0280742
anli_r3	acc	0.336667	0.0136476	0.3525	0.0137972	-
blimp_regular_plural_subject_verb_agreement_2	acc	0.897	0.00961683	0.916	0.00877616	-
blimp_tough_vs_raising_2	acc	0.826	0.0119945	0.857	0.0110758	-
mnli	acc	0.316047	0.00469317	0.374733	0.00488619	-
drop	em	0.0595638	0.00242379	0.0228607	0.0015306	+
drop	f1	0.120355	0.00270951	0.103871	0.00219977	+
blimp_determiner_noun_agreement_with_adj_2	acc	0.95	0.00689547	0.936	0.00774364	+
arithmetic_2dm	acc	0.061	0.00535293	0.14	0.00776081	-
blimp_determiner_noun_agreement_irregular_2	acc	0.93	0.00807249	0.932	0.00796489
lambada	ppl	4.62504	0.10549	4.10224	0.0884971	-
lambada	acc	0.648554	0.00665142	0.682127	0.00648741	-
arithmetic_3da	acc	0.007	0.00186474	0.0865	0.00628718	-
blimp_irregular_past_participle_adjectives	acc	0.947	0.00708811	0.956	0.00648892	-
hendrycksTest-college_biology	acc	0.201389	0.0335365	0.284722	0.0377381	-
hendrycksTest-college_biology	acc_norm	0.222222	0.0347659	0.270833	0.0371618	-
headqa_en	acc	0.324945	0.00894582	0.335522	0.00901875	-
headqa_en	acc_norm	0.375638	0.00925014	0.383297	0.00928648
blimp_determiner_noun_agreement_irregular_1	acc	0.912	0.00896305	0.944	0.0072744	-
blimp_existential_there_quantifiers_1	acc	0.985	0.00384575	0.981	0.00431945
blimp_inchoative	acc	0.653	0.0150605	0.683	0.0147217	-
mutual_plus	r@1	0.395034	0.0164328	0.409707	0.016531
mutual_plus	r@2	0.674944	0.015745	0.680587	0.0156728
mutual_plus	mrr	0.632713	0.0103391	0.640801	0.0104141
blimp_tough_vs_raising_1	acc	0.736	0.0139463	0.734	0.01398
winogrande	acc	0.636148	0.0135215	0.640884	0.0134831
race	acc	0.374163	0.0149765	0.37512	0.0149842
blimp_irregular_plural_subject_verb_agreement_1	acc	0.908	0.00914438	0.918	0.00868052	-
hendrycksTest-high_school_macroeconomics	acc	0.284615	0.0228783	0.284615	0.0228783
hendrycksTest-high_school_macroeconomics	acc_norm	0.284615	0.0228783	0.276923	0.022688
blimp_adjunct_island	acc	0.888	0.00997775	0.902	0.00940662	-
hendrycksTest-high_school_chemistry	acc	0.236453	0.0298961	0.211823	0.028749
hendrycksTest-high_school_chemistry	acc_norm	0.300493	0.032258	0.29064	0.0319474
arithmetic_2ds	acc	0.051	0.00492053	0.218	0.00923475	-
blimp_principle_A_case_2	acc	0.955	0.00655881	0.953	0.00669596
blimp_only_npi_licensor_present	acc	0.926	0.00828206	0.953	0.00669596	-
math_counting_and_prob	acc	0.0274262	0.00750954	0.0021097	0.0021097	+
cola	mcc	-0.0854256	0.0304519	-0.0504508	0.0251594	-
webqs	acc	0.023622	0.00336987	0.0226378	0.00330058
arithmetic_4ds	acc	0.0005	0.0005	0.0055	0.00165416	-
blimp_wh_vs_that_no_gap_long_distance	acc	0.94	0.00751375	0.939	0.00757208
pile_bookcorpus2	word_perplexity	28.7786	0	27.0559	0	-
pile_bookcorpus2	byte_perplexity	1.79969	0	1.78037	0	-
pile_bookcorpus2	bits_per_byte	0.847751	0	0.832176	0	-
blimp_sentential_negation_npi_licensor_present	acc	0.994	0.00244335	0.982	0.00420639	+
hendrycksTest-high_school_government_and_politics	acc	0.274611	0.0322102	0.227979	0.0302769	+
hendrycksTest-high_school_government_and_politics	acc_norm	0.259067	0.0316188	0.248705	0.0311958
blimp_ellipsis_n_bar_2	acc	0.937	0.00768701	0.916	0.00877616	+
hendrycksTest-clinical_knowledge	acc	0.283019	0.0277242	0.267925	0.0272573
hendrycksTest-clinical_knowledge	acc_norm	0.343396	0.0292245	0.316981	0.0286372
mc_taco	em	0.125375	0	0.132883	0	-
mc_taco	f1	0.487131	0	0.499712	0	-
wsc	acc	0.365385	0.0474473	0.365385	0.0474473
hendrycksTest-college_medicine	acc	0.231214	0.0321474	0.190751	0.0299579	+
hendrycksTest-college_medicine	acc_norm	0.289017	0.0345643	0.265896	0.0336876
hendrycksTest-high_school_world_history	acc	0.295359	0.0296963	0.2827	0.0293128
hendrycksTest-high_school_world_history	acc_norm	0.312236	0.0301651	0.312236	0.0301651
hendrycksTest-anatomy	acc	0.296296	0.0394462	0.281481	0.03885
hendrycksTest-anatomy	acc_norm	0.288889	0.0391545	0.266667	0.0382017
hendrycksTest-jurisprudence	acc	0.25	0.0418609	0.277778	0.0433004
hendrycksTest-jurisprudence	acc_norm	0.416667	0.0476608	0.425926	0.0478034
logiqa	acc	0.193548	0.0154963	0.211982	0.016031	-
logiqa	acc_norm	0.281106	0.0176324	0.291859	0.0178316
ethics_utilitarianism_original	acc	0.767679	0.00609112	0.941556	0.00338343	-
blimp_principle_A_c_command	acc	0.827	0.0119672	0.81	0.0124119	+
blimp_coordinate_structure_constraint_complex_left_branch	acc	0.794	0.0127956	0.764	0.0134345	+
arithmetic_5ds	acc	0	0	0	0
lambada_mt_it	ppl	96.8846	5.80902	86.66	5.1869	-
lambada_mt_it	acc	0.328158	0.00654165	0.336891	0.0065849	-
wsc273	acc	0.827839	0.0228905	0.827839	0.0228905
blimp_coordinate_structure_constraint_object_extraction	acc	0.852	0.0112349	0.876	0.0104275	-
blimp_principle_A_domain_3	acc	0.79	0.0128867	0.819	0.0121814	-
blimp_left_branch_island_echo_question	acc	0.638	0.0152048	0.519	0.0158079	+
rte	acc	0.534296	0.0300256	0.548736	0.0299531
blimp_passive_2	acc	0.892	0.00982	0.899	0.00953362
hendrycksTest-electrical_engineering	acc	0.344828	0.0396093	0.358621	0.0399663
hendrycksTest-electrical_engineering	acc_norm	0.372414	0.0402873	0.372414	0.0402873
sst	acc	0.626147	0.0163938	0.493119	0.0169402	+
blimp_npi_present_1	acc	0.565	0.0156851	0.576	0.0156355
piqa	acc	0.739391	0.0102418	0.754081	0.0100473	-
piqa	acc_norm	0.755169	0.0100323	0.761697	0.00994033
hendrycksTest-professional_accounting	acc	0.312057	0.0276401	0.265957	0.0263581	+
hendrycksTest-professional_accounting	acc_norm	0.27305	0.0265779	0.22695	0.0249871	+
arc_challenge	acc	0.325085	0.0136881	0.337884	0.013822
arc_challenge	acc_norm	0.352389	0.0139601	0.366041	0.0140772
hendrycksTest-econometrics	acc	0.263158	0.0414244	0.245614	0.0404934
hendrycksTest-econometrics	acc_norm	0.254386	0.0409699	0.27193	0.0418577
headqa	acc	0.238877	0.00814442	0.251276	0.0082848	-
headqa	acc_norm	0.290664	0.00867295	0.286652	0.00863721
wic	acc	0.482759	0.0197989	0.5	0.0198107
hendrycksTest-high_school_biology	acc	0.270968	0.0252844	0.251613	0.024686
hendrycksTest-high_school_biology	acc_norm	0.274194	0.0253781	0.283871	0.0256494
hendrycksTest-management	acc	0.281553	0.0445325	0.23301	0.0418583	+
hendrycksTest-management	acc_norm	0.291262	0.0449868	0.320388	0.0462028
blimp_npi_present_2	acc	0.645	0.0151395	0.664	0.0149441	-
hendrycksTest-prehistory	acc	0.265432	0.0245692	0.243827	0.0238919
hendrycksTest-prehistory	acc_norm	0.225309	0.0232462	0.219136	0.0230167
hendrycksTest-world_religions	acc	0.321637	0.0358253	0.333333	0.0361551
hendrycksTest-world_religions	acc_norm	0.397661	0.0375364	0.380117	0.0372297
math_intermediate_algebra	acc	0.00996678	0.00330749	0.00332226	0.00191598	+
anagrams2	acc	0.0347	0.00183028	0.0055	0.000739615	+
arc_easy	acc	0.647306	0.00980442	0.669613	0.00965143	-
arc_easy	acc_norm	0.609848	0.0100091	0.622896	0.00994504	-
blimp_anaphor_gender_agreement	acc	0.993	0.00263779	0.994	0.00244335
hendrycksTest-marketing	acc	0.311966	0.0303515	0.307692	0.0302364
hendrycksTest-marketing	acc_norm	0.34188	0.031075	0.294872	0.0298726	+
blimp_principle_A_domain_1	acc	0.997	0.00173032	0.997	0.00173032
blimp_wh_island	acc	0.856	0.011108	0.852	0.0112349
hendrycksTest-sociology	acc	0.303483	0.0325101	0.278607	0.0317006
hendrycksTest-sociology	acc_norm	0.298507	0.0323574	0.318408	0.0329412
blimp_distractor_agreement_relative_clause	acc	0.774	0.0132325	0.719	0.0142212	+
truthfulqa_gen	bleurt_max	-0.811655	0.0180743	-0.814228	0.0172128
truthfulqa_gen	bleurt_acc	0.395349	0.0171158	0.329253	0.0164513	+
truthfulqa_gen	bleurt_diff	-0.0488385	0.0204525	-0.185905	0.0169617	+
truthfulqa_gen	bleu_max	20.8747	0.717003	20.2238	0.711772
truthfulqa_gen	bleu_acc	0.330477	0.0164668	0.281518	0.015744	+
truthfulqa_gen	bleu_diff	-2.12856	0.832693	-6.66121	0.719366	+
truthfulqa_gen	rouge1_max	47.0293	0.962404	45.3457	0.89238	+
truthfulqa_gen	rouge1_acc	0.341493	0.0166007	0.257038	0.0152981	+
truthfulqa_gen	rouge1_diff	-2.29454	1.2086	-10.1049	0.8922	+
truthfulqa_gen	rouge2_max	31.0617	1.08725	28.7438	0.981282	+
truthfulqa_gen	rouge2_acc	0.247246	0.0151024	0.201958	0.014054	+
truthfulqa_gen	rouge2_diff	-2.84021	1.28749	-11.0916	1.01664	+
truthfulqa_gen	rougeL_max	44.6463	0.966119	42.6116	0.893252	+
truthfulqa_gen	rougeL_acc	0.334149	0.0165125	0.24235	0.0150007	+
truthfulqa_gen	rougeL_diff	-2.50853	1.22016	-10.4299	0.904205	+
hendrycksTest-public_relations	acc	0.3	0.0438931	0.281818	0.0430912
hendrycksTest-public_relations	acc_norm	0.190909	0.0376443	0.163636	0.0354343
blimp_distractor_agreement_relational_noun	acc	0.859	0.0110109	0.833	0.0118004	+
lambada_mt_fr	ppl	57.0379	3.15719	51.7313	2.90272	-
lambada_mt_fr	acc	0.388512	0.0067906	0.40947	0.00685084	-
blimp_principle_A_case_1	acc	1	0	1	0
hendrycksTest-medical_genetics	acc	0.37	0.0485237	0.31	0.0464823	+
hendrycksTest-medical_genetics	acc_norm	0.41	0.0494311	0.39	0.0490207
qqp	acc	0.364383	0.00239348	0.383626	0.00241841	-
qqp	f1	0.516391	0.00263674	0.451222	0.00289696	+
iwslt17-en-ar	bleu	2.35563	0.188638	4.98225	0.275369	-
iwslt17-en-ar	chrf	0.140912	0.00503101	0.277708	0.00415432	-
iwslt17-en-ar	ter	1.0909	0.0122111	0.954701	0.0126737	-
multirc	acc	0.0409234	0.00642087	0.0178384	0.00428994	+
hendrycksTest-human_aging	acc	0.264574	0.0296051	0.264574	0.0296051
hendrycksTest-human_aging	acc_norm	0.197309	0.0267099	0.237668	0.0285681	-
reversed_words	acc	0.0003	0.000173188	0	0	+

(Some results are missing due to errors or computational constraints.)