Note that I have no association with any torrents or backups, or other ways of obtaining this model. However, if you try them, please be safe. Here are the hex md5 hashes for the pytorch_model.bin files:
pytorch_model.bin float32 : 833c1dc19b7450e4e559a9917b7d076a
pytorch_model.bin float16 : db3105866c9563b26f7399fafc00bb4b
And here are the SHA256 hashes:
pytorch_model.bin float32 : fc396bb082401c3c10daa1f0174d10782d95218181a8a6994f6112eb09d5a7e2
pytorch_model.bin float16 : 04195ed0feff23998edc1c486b7504a0c38c75c828018ed2743e0fa7ef4cb1df
GPT-4chan is a language model fine-tuned from GPT-J 6B on 3.5 years worth of data from 4chan's politically incorrect (/pol/) board.
GPT-4chan was fine-tuned on the dataset Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board.
The model was trained for 1 epoch following GPT-J's fine-tuning guide.
GPT-4chan is trained on anonymously posted and sparsely moderated discussions of political topics. Its intended use is to reproduce text according to the distribution of its input data. It may also be a useful tool to investigate discourse in such anonymous online communities. Lastly, it has potential applications in tasks such as toxicity detection, as initial experiments show promising zero-shot results when comparing a string's likelihood under GPT-4chan to its likelihood under GPT-J 6B.
The following is copied from the Hugging Face documentation on GPT-J. Refer to the original for more details.
For inference parameters, we recommend a temperature of 0.8, along with either a top_p of 0.8 or a typical_p of 0.3.
For the float32 model (CPU):
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ykilcher/gpt-4chan")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
prompt = (
"In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
"previously unexplored valley, in the Andes Mountains. Even more surprising to the "
"researchers was the fact that the unicorns spoke perfect English."
)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_tokens = model.generate(
input_ids,
do_sample=True,
temperature=0.8,
top_p=0.9,
max_length=100,
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
For the float16 model (GPU):
from transformers import GPTJForCausalLM, AutoTokenizer
import torch
from transformers import GPTJForCausalLM
import torch
model = GPTJForCausalLM.from_pretrained(
"ykilcher/gpt-4chan", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
)
model.cuda()
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
prompt = (
"In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
"previously unexplored valley, in the Andes Mountains. Even more surprising to the "
"researchers was the fact that the unicorns spoke perfect English."
)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.cuda()
gen_tokens = model.generate(
input_ids,
do_sample=True,
temperature=0.8,
top_p=0.9,
max_length=100,
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
This is a statistical model. As such, it continues text as is likely under the distribution the model has learned from the training data. Outputs should not be interpreted as "correct", "truthful", or otherwise as anything more than a statistical function of the input. That being said, GPT-4chan does significantly outperform GPT-J (and GPT-3) on the TruthfulQA Benchmark that measures whether a language model is truthful in generating answers to questions.
The dataset is time- and domain-limited. It was collected from 2016 to 2019 on 4chan's politically incorrect board. As such, political topics from that area will be overrepresented in the model's distribution, compared to other models (e.g. GPT-J 6B). Also, due to the very lax rules and anonymity of posters, a large part of the dataset contains offensive material. Thus, it is very likely that the model will produce offensive outputs, including but not limited to: toxicity, hate speech, racism, sexism, homo- and transphobia, xenophobia, and anti-semitism.
Due to the above limitations, it is strongly recommend to not deploy this model into a real-world environment unless its behavior is well-understood and explicit and strict limitations on the scope, impact, and duration of the deployment are enforced.
The following table compares GPT-J 6B to GPT-4chan on a subset of the Language Model Evaluation Harness. Differences exceeding standard errors are marked in the "Significant" column with a minus sign (-) indicating an advantage for GPT-J 6B and a plus sign (+) indicating an advantage for GPT-4chan.
Task | Metric | GPT-4chan | stderr | GPT-J-6B | stderr | Significant |
---|---|---|---|---|---|---|
copa | acc | 0.85 | 0.035887 | 0.83 | 0.0377525 | |
blimp_only_npi_scope | acc | 0.712 | 0.0143269 | 0.787 | 0.0129537 | - |
hendrycksTest-conceptual_physics | acc | 0.251064 | 0.028347 | 0.255319 | 0.0285049 | |
hendrycksTest-conceptual_physics | acc_norm | 0.187234 | 0.0255016 | 0.191489 | 0.0257221 | |
hendrycksTest-high_school_mathematics | acc | 0.248148 | 0.0263357 | 0.218519 | 0.0251958 | + |
hendrycksTest-high_school_mathematics | acc_norm | 0.3 | 0.0279405 | 0.251852 | 0.0264661 | + |
blimp_sentential_negation_npi_scope | acc | 0.734 | 0.01398 | 0.733 | 0.0139967 | |
hendrycksTest-high_school_european_history | acc | 0.278788 | 0.0350144 | 0.260606 | 0.0342774 | |
hendrycksTest-high_school_european_history | acc_norm | 0.315152 | 0.0362773 | 0.278788 | 0.0350144 | + |
blimp_wh_questions_object_gap | acc | 0.841 | 0.0115695 | 0.835 | 0.0117436 | |
hendrycksTest-international_law | acc | 0.214876 | 0.0374949 | 0.264463 | 0.0402619 | - |
hendrycksTest-international_law | acc_norm | 0.438017 | 0.0452915 | 0.404959 | 0.0448114 | |
hendrycksTest-high_school_us_history | acc | 0.323529 | 0.0328347 | 0.289216 | 0.0318223 | + |
hendrycksTest-high_school_us_history | acc_norm | 0.323529 | 0.0328347 | 0.29902 | 0.0321333 | |
openbookqa | acc | 0.276 | 0.0200112 | 0.29 | 0.0203132 | |
openbookqa | acc_norm | 0.362 | 0.0215137 | 0.382 | 0.0217508 | |
blimp_causative | acc | 0.737 | 0.0139293 | 0.761 | 0.013493 | - |
record | f1 | 0.878443 | 0.00322394 | 0.885049 | 0.00314367 | - |
record | em | 0.8702 | 0.003361 | 0.8765 | 0.00329027 | - |
blimp_determiner_noun_agreement_1 | acc | 0.996 | 0.00199699 | 0.995 | 0.00223159 | |
hendrycksTest-miscellaneous | acc | 0.305236 | 0.0164677 | 0.274585 | 0.0159598 | + |
hendrycksTest-miscellaneous | acc_norm | 0.269476 | 0.0158662 | 0.260536 | 0.015696 | |
hendrycksTest-virology | acc | 0.343373 | 0.0369658 | 0.349398 | 0.0371173 | |
hendrycksTest-virology | acc_norm | 0.331325 | 0.0366431 | 0.325301 | 0.0364717 | |
mathqa | acc | 0.269012 | 0.00811786 | 0.267002 | 0.00809858 | |
mathqa | acc_norm | 0.261642 | 0.00804614 | 0.270687 | 0.00813376 | - |
squad2 | exact | 10.6123 | 0 | 10.6207 | 0 | - |
squad2 | f1 | 17.8734 | 0 | 17.7413 | 0 | + |
squad2 | HasAns_exact | 17.2571 | 0 | 15.5027 | 0 | + |
squad2 | HasAns_f1 | 31.8 | 0 | 29.7643 | 0 | + |
squad2 | NoAns_exact | 3.98654 | 0 | 5.75273 | 0 | - |
squad2 | NoAns_f1 | 3.98654 | 0 | 5.75273 | 0 | - |
squad2 | best_exact | 50.0716 | 0 | 50.0716 | 0 | |
squad2 | best_f1 | 50.077 | 0 | 50.0778 | 0 | - |
mnli_mismatched | acc | 0.320586 | 0.00470696 | 0.376627 | 0.00488687 | - |
blimp_animate_subject_passive | acc | 0.79 | 0.0128867 | 0.781 | 0.0130847 | |
blimp_determiner_noun_agreement_with_adj_irregular_1 | acc | 0.834 | 0.0117721 | 0.878 | 0.0103549 | - |
qnli | acc | 0.491305 | 0.00676439 | 0.513454 | 0.00676296 | - |
blimp_intransitive | acc | 0.806 | 0.0125108 | 0.858 | 0.0110435 | - |
ethics_cm | acc | 0.512227 | 0.00802048 | 0.559846 | 0.00796521 | - |
hendrycksTest-high_school_computer_science | acc | 0.2 | 0.0402015 | 0.25 | 0.0435194 | - |
hendrycksTest-high_school_computer_science | acc_norm | 0.26 | 0.0440844 | 0.27 | 0.0446196 | |
iwslt17-ar-en | bleu | 21.4685 | 0.64825 | 20.7322 | 0.795602 | + |
iwslt17-ar-en | chrf | 0.452175 | 0.00498012 | 0.450919 | 0.00526515 | |
iwslt17-ar-en | ter | 0.733514 | 0.0201688 | 0.787631 | 0.0285488 | + |
hendrycksTest-security_studies | acc | 0.391837 | 0.0312513 | 0.363265 | 0.0307891 | |
hendrycksTest-security_studies | acc_norm | 0.285714 | 0.0289206 | 0.285714 | 0.0289206 | |
hendrycksTest-global_facts | acc | 0.29 | 0.0456048 | 0.25 | 0.0435194 | |
hendrycksTest-global_facts | acc_norm | 0.26 | 0.0440844 | 0.22 | 0.0416333 | |
anli_r1 | acc | 0.297 | 0.0144568 | 0.322 | 0.0147829 | - |
blimp_left_branch_island_simple_question | acc | 0.884 | 0.0101315 | 0.867 | 0.0107437 | + |
hendrycksTest-astronomy | acc | 0.25 | 0.0352381 | 0.25 | 0.0352381 | |
hendrycksTest-astronomy | acc_norm | 0.348684 | 0.0387814 | 0.335526 | 0.038425 | |
mrpc | acc | 0.536765 | 0.024717 | 0.683824 | 0.0230483 | - |
mrpc | f1 | 0.63301 | 0.0247985 | 0.812227 | 0.0162476 | - |
ethics_utilitarianism | acc | 0.525374 | 0.00720233 | 0.509775 | 0.00721024 | + |
blimp_determiner_noun_agreement_2 | acc | 0.99 | 0.003148 | 0.977 | 0.00474273 | + |
lambada_cloze | ppl | 388.123 | 13.1523 | 405.646 | 14.5519 | + |
lambada_cloze | acc | 0.0116437 | 0.00149456 | 0.0199884 | 0.00194992 | - |
truthfulqa_mc | mc1 | 0.225214 | 0.0146232 | 0.201958 | 0.014054 | + |
truthfulqa_mc | mc2 | 0.371625 | 0.0136558 | 0.359537 | 0.0134598 | |
blimp_wh_vs_that_with_gap_long_distance | acc | 0.441 | 0.0157088 | 0.342 | 0.0150087 | + |
hendrycksTest-business_ethics | acc | 0.28 | 0.0451261 | 0.29 | 0.0456048 | |
hendrycksTest-business_ethics | acc_norm | 0.29 | 0.0456048 | 0.3 | 0.0460566 | |
arithmetic_3ds | acc | 0.0065 | 0.00179736 | 0.046 | 0.0046854 | - |
blimp_determiner_noun_agreement_with_adjective_1 | acc | 0.988 | 0.00344498 | 0.978 | 0.00464086 | + |
hendrycksTest-moral_disputes | acc | 0.277457 | 0.0241057 | 0.283237 | 0.0242579 | |
hendrycksTest-moral_disputes | acc_norm | 0.309249 | 0.0248831 | 0.32659 | 0.0252483 | |
arithmetic_2da | acc | 0.0455 | 0.00466109 | 0.2405 | 0.00955906 | - |
qa4mre_2011 | acc | 0.425 | 0.0453163 | 0.458333 | 0.0456755 | |
qa4mre_2011 | acc_norm | 0.558333 | 0.0455219 | 0.533333 | 0.045733 | |
blimp_regular_plural_subject_verb_agreement_1 | acc | 0.966 | 0.00573384 | 0.968 | 0.00556839 | |
hendrycksTest-human_sexuality | acc | 0.389313 | 0.0427649 | 0.396947 | 0.0429114 | |
hendrycksTest-human_sexuality | acc_norm | 0.305344 | 0.0403931 | 0.343511 | 0.0416498 | |
blimp_passive_1 | acc | 0.878 | 0.0103549 | 0.885 | 0.0100934 | |
blimp_drop_argument | acc | 0.784 | 0.0130197 | 0.823 | 0.0120755 | - |
hendrycksTest-high_school_microeconomics | acc | 0.260504 | 0.0285103 | 0.277311 | 0.0290794 | |
hendrycksTest-high_school_microeconomics | acc_norm | 0.390756 | 0.0316938 | 0.39916 | 0.0318111 | |
hendrycksTest-us_foreign_policy | acc | 0.32 | 0.0468826 | 0.34 | 0.0476095 | |
hendrycksTest-us_foreign_policy | acc_norm | 0.4 | 0.0492366 | 0.35 | 0.0479372 | + |
blimp_ellipsis_n_bar_1 | acc | 0.846 | 0.0114199 | 0.841 | 0.0115695 | |
hendrycksTest-high_school_physics | acc | 0.264901 | 0.0360304 | 0.271523 | 0.0363133 | |
hendrycksTest-high_school_physics | acc_norm | 0.284768 | 0.0368488 | 0.271523 | 0.0363133 | |
qa4mre_2013 | acc | 0.362676 | 0.028579 | 0.401408 | 0.0291384 | - |
qa4mre_2013 | acc_norm | 0.387324 | 0.0289574 | 0.383803 | 0.0289082 | |
blimp_wh_vs_that_no_gap | acc | 0.963 | 0.00597216 | 0.969 | 0.00548353 | - |
headqa_es | acc | 0.238877 | 0.00814442 | 0.251276 | 0.0082848 | - |
headqa_es | acc_norm | 0.290664 | 0.00867295 | 0.286652 | 0.00863721 | |
blimp_sentential_subject_island | acc | 0.359 | 0.0151773 | 0.421 | 0.0156206 | - |
hendrycksTest-philosophy | acc | 0.241158 | 0.0242966 | 0.26045 | 0.0249267 | |
hendrycksTest-philosophy | acc_norm | 0.327974 | 0.0266644 | 0.334405 | 0.0267954 | |
hendrycksTest-elementary_mathematics | acc | 0.248677 | 0.0222618 | 0.251323 | 0.0223405 | |
hendrycksTest-elementary_mathematics | acc_norm | 0.275132 | 0.0230001 | 0.26455 | 0.0227175 | |
math_geometry | acc | 0.0187891 | 0.00621042 | 0.0104384 | 0.00464863 | + |
blimp_wh_questions_subject_gap_long_distance | acc | 0.886 | 0.0100551 | 0.883 | 0.0101693 | |
hendrycksTest-college_physics | acc | 0.205882 | 0.0402338 | 0.205882 | 0.0402338 | |
hendrycksTest-college_physics | acc_norm | 0.22549 | 0.0415831 | 0.245098 | 0.0428011 | |
hellaswag | acc | 0.488747 | 0.00498852 | 0.49532 | 0.00498956 | - |
hellaswag | acc_norm | 0.648277 | 0.00476532 | 0.66202 | 0.00472055 | - |
hendrycksTest-logical_fallacies | acc | 0.269939 | 0.0348783 | 0.294479 | 0.0358117 | |
hendrycksTest-logical_fallacies | acc_norm | 0.343558 | 0.0373113 | 0.355828 | 0.0376152 | |
hendrycksTest-machine_learning | acc | 0.339286 | 0.0449395 | 0.223214 | 0.039523 | + |
hendrycksTest-machine_learning | acc_norm | 0.205357 | 0.0383424 | 0.178571 | 0.0363521 | |
hendrycksTest-high_school_psychology | acc | 0.286239 | 0.0193794 | 0.273394 | 0.0191093 | |
hendrycksTest-high_school_psychology | acc_norm | 0.266055 | 0.018946 | 0.269725 | 0.0190285 | |
prost | acc | 0.256298 | 0.00318967 | 0.268254 | 0.00323688 | - |
prost | acc_norm | 0.280156 | 0.00328089 | 0.274658 | 0.00326093 | + |
blimp_determiner_noun_agreement_with_adj_irregular_2 | acc | 0.898 | 0.00957537 | 0.916 | 0.00877616 | - |
wnli | acc | 0.43662 | 0.0592794 | 0.464789 | 0.0596131 | |
hendrycksTest-professional_law | acc | 0.284876 | 0.0115278 | 0.273794 | 0.0113886 | |
hendrycksTest-professional_law | acc_norm | 0.301825 | 0.0117244 | 0.292699 | 0.0116209 | |
math_algebra | acc | 0.0126369 | 0.00324352 | 0.0117944 | 0.00313487 | |
wikitext | word_perplexity | 11.4687 | 0 | 10.8819 | 0 | - |
wikitext | byte_perplexity | 1.5781 | 0 | 1.56268 | 0 | - |
wikitext | bits_per_byte | 0.658188 | 0 | 0.644019 | 0 | - |
anagrams1 | acc | 0.0125 | 0.00111108 | 0.0008 | 0.000282744 | + |
math_prealgebra | acc | 0.0195178 | 0.00469003 | 0.0126292 | 0.00378589 | + |
blimp_principle_A_domain_2 | acc | 0.887 | 0.0100166 | 0.889 | 0.0099387 | |
cycle_letters | acc | 0.0331 | 0.00178907 | 0.0026 | 0.000509264 | + |
hendrycksTest-college_mathematics | acc | 0.26 | 0.0440844 | 0.26 | 0.0440844 | |
hendrycksTest-college_mathematics | acc_norm | 0.31 | 0.0464823 | 0.4 | 0.0492366 | - |
arithmetic_1dc | acc | 0.077 | 0.00596266 | 0.089 | 0.00636866 | - |
arithmetic_4da | acc | 0.0005 | 0.0005 | 0.007 | 0.00186474 | - |
triviaqa | acc | 0.150888 | 0.00336543 | 0.167418 | 0.00351031 | - |
boolq | acc | 0.673394 | 0.00820236 | 0.655352 | 0.00831224 | + |
random_insertion | acc | 0.0004 | 0.00019997 | 0 | 0 | + |
qa4mre_2012 | acc | 0.4 | 0.0388514 | 0.4125 | 0.0390407 | |
qa4mre_2012 | acc_norm | 0.4625 | 0.0395409 | 0.50625 | 0.0396495 | - |
math_asdiv | acc | 0.00997831 | 0.00207066 | 0.00563991 | 0.00156015 | + |
hendrycksTest-moral_scenarios | acc | 0.236872 | 0.0142196 | 0.236872 | 0.0142196 | |
hendrycksTest-moral_scenarios | acc_norm | 0.272626 | 0.0148934 | 0.272626 | 0.0148934 | |
hendrycksTest-high_school_geography | acc | 0.247475 | 0.0307463 | 0.20202 | 0.0286062 | + |
hendrycksTest-high_school_geography | acc_norm | 0.287879 | 0.0322588 | 0.292929 | 0.032425 | |
gsm8k | acc | 0 | 0 | 0 | 0 | |
blimp_existential_there_object_raising | acc | 0.812 | 0.0123616 | 0.792 | 0.0128414 | + |
blimp_superlative_quantifiers_2 | acc | 0.917 | 0.00872853 | 0.865 | 0.0108117 | + |
hendrycksTest-college_chemistry | acc | 0.28 | 0.0451261 | 0.24 | 0.0429235 | |
hendrycksTest-college_chemistry | acc_norm | 0.31 | 0.0464823 | 0.28 | 0.0451261 | |
blimp_existential_there_quantifiers_2 | acc | 0.545 | 0.0157551 | 0.383 | 0.0153801 | + |
hendrycksTest-abstract_algebra | acc | 0.17 | 0.0377525 | 0.26 | 0.0440844 | - |
hendrycksTest-abstract_algebra | acc_norm | 0.26 | 0.0440844 | 0.3 | 0.0460566 | |
hendrycksTest-professional_psychology | acc | 0.26634 | 0.0178832 | 0.28268 | 0.0182173 | |
hendrycksTest-professional_psychology | acc_norm | 0.256536 | 0.0176678 | 0.259804 | 0.0177409 | |
ethics_virtue | acc | 0.249849 | 0.00613847 | 0.200201 | 0.00567376 | + |
ethics_virtue | em | 0.0040201 | 0 | 0 | 0 | + |
arithmetic_5da | acc | 0 | 0 | 0.0005 | 0.0005 | - |
mutual | r@1 | 0.455982 | 0.0167421 | 0.468397 | 0.0167737 | |
mutual | r@2 | 0.732506 | 0.0148796 | 0.735892 | 0.0148193 | |
mutual | mrr | 0.675226 | 0.0103132 | 0.682186 | 0.0103375 | |
blimp_irregular_past_participle_verbs | acc | 0.869 | 0.0106749 | 0.876 | 0.0104275 | |
ethics_deontology | acc | 0.497775 | 0.00833904 | 0.523637 | 0.0083298 | - |
ethics_deontology | em | 0.00333704 | 0 | 0.0355951 | 0 | - |
blimp_transitive | acc | 0.818 | 0.0122076 | 0.855 | 0.01114 | - |
hendrycksTest-college_computer_science | acc | 0.29 | 0.0456048 | 0.27 | 0.0446196 | |
hendrycksTest-college_computer_science | acc_norm | 0.27 | 0.0446196 | 0.26 | 0.0440844 | |
hendrycksTest-professional_medicine | acc | 0.283088 | 0.0273659 | 0.272059 | 0.027033 | |
hendrycksTest-professional_medicine | acc_norm | 0.279412 | 0.0272572 | 0.261029 | 0.0266793 | |
sciq | acc | 0.895 | 0.00969892 | 0.915 | 0.00882343 | - |
sciq | acc_norm | 0.869 | 0.0106749 | 0.874 | 0.0104992 | |
blimp_anaphor_number_agreement | acc | 0.993 | 0.00263779 | 0.995 | 0.00223159 | |
blimp_wh_questions_subject_gap | acc | 0.925 | 0.00833333 | 0.913 | 0.00891687 | + |
blimp_wh_vs_that_with_gap | acc | 0.482 | 0.015809 | 0.429 | 0.015659 | + |
math_num_theory | acc | 0.0351852 | 0.00793611 | 0.0203704 | 0.00608466 | + |
blimp_complex_NP_island | acc | 0.538 | 0.0157735 | 0.535 | 0.0157805 | |
blimp_expletive_it_object_raising | acc | 0.777 | 0.0131698 | 0.78 | 0.0131062 | |
lambada_mt_en | ppl | 4.62504 | 0.10549 | 4.10224 | 0.0884971 | - |
lambada_mt_en | acc | 0.648554 | 0.00665142 | 0.682127 | 0.00648741 | - |
hendrycksTest-formal_logic | acc | 0.309524 | 0.0413491 | 0.34127 | 0.042408 | |
hendrycksTest-formal_logic | acc_norm | 0.325397 | 0.041906 | 0.325397 | 0.041906 | |
blimp_matrix_question_npi_licensor_present | acc | 0.663 | 0.0149551 | 0.727 | 0.014095 | - |
blimp_superlative_quantifiers_1 | acc | 0.791 | 0.0128641 | 0.871 | 0.0106053 | - |
lambada_mt_de | ppl | 89.7905 | 5.30301 | 82.2416 | 4.88447 | - |
lambada_mt_de | acc | 0.312245 | 0.0064562 | 0.312827 | 0.00645948 | |
hendrycksTest-computer_security | acc | 0.37 | 0.0485237 | 0.27 | 0.0446196 | + |
hendrycksTest-computer_security | acc_norm | 0.37 | 0.0485237 | 0.33 | 0.0472582 | |
ethics_justice | acc | 0.501479 | 0.00961712 | 0.526627 | 0.00960352 | - |
ethics_justice | em | 0 | 0 | 0.0251479 | 0 | - |
blimp_principle_A_reconstruction | acc | 0.296 | 0.0144427 | 0.444 | 0.0157198 | - |
blimp_existential_there_subject_raising | acc | 0.877 | 0.0103913 | 0.875 | 0.0104635 | |
math_precalc | acc | 0.014652 | 0.00514689 | 0.0018315 | 0.0018315 | + |
qasper | f1_yesno | 0.632997 | 0.032868 | 0.666667 | 0.0311266 | - |
qasper | f1_abstractive | 0.113489 | 0.00729073 | 0.118383 | 0.00692993 | |
cb | acc | 0.196429 | 0.0535714 | 0.357143 | 0.0646096 | - |
cb | f1 | 0.149038 | 0 | 0.288109 | 0 | - |
blimp_animate_subject_trans | acc | 0.858 | 0.0110435 | 0.868 | 0.0107094 | |
hendrycksTest-high_school_statistics | acc | 0.310185 | 0.031547 | 0.291667 | 0.0309987 | |
hendrycksTest-high_school_statistics | acc_norm | 0.361111 | 0.0327577 | 0.314815 | 0.0316747 | + |
blimp_irregular_plural_subject_verb_agreement_2 | acc | 0.881 | 0.0102442 | 0.919 | 0.00863212 | - |
lambada_mt_es | ppl | 92.1172 | 5.05064 | 83.6696 | 4.57489 | - |
lambada_mt_es | acc | 0.322337 | 0.00651139 | 0.326994 | 0.00653569 | |
anli_r2 | acc | 0.327 | 0.0148422 | 0.337 | 0.0149551 | |
hendrycksTest-nutrition | acc | 0.346405 | 0.0272456 | 0.346405 | 0.0272456 | |
hendrycksTest-nutrition | acc_norm | 0.385621 | 0.0278707 | 0.401961 | 0.0280742 | |
anli_r3 | acc | 0.336667 | 0.0136476 | 0.3525 | 0.0137972 | - |
blimp_regular_plural_subject_verb_agreement_2 | acc | 0.897 | 0.00961683 | 0.916 | 0.00877616 | - |
blimp_tough_vs_raising_2 | acc | 0.826 | 0.0119945 | 0.857 | 0.0110758 | - |
mnli | acc | 0.316047 | 0.00469317 | 0.374733 | 0.00488619 | - |
drop | em | 0.0595638 | 0.00242379 | 0.0228607 | 0.0015306 | + |
drop | f1 | 0.120355 | 0.00270951 | 0.103871 | 0.00219977 | + |
blimp_determiner_noun_agreement_with_adj_2 | acc | 0.95 | 0.00689547 | 0.936 | 0.00774364 | + |
arithmetic_2dm | acc | 0.061 | 0.00535293 | 0.14 | 0.00776081 | - |
blimp_determiner_noun_agreement_irregular_2 | acc | 0.93 | 0.00807249 | 0.932 | 0.00796489 | |
lambada | ppl | 4.62504 | 0.10549 | 4.10224 | 0.0884971 | - |
lambada | acc | 0.648554 | 0.00665142 | 0.682127 | 0.00648741 | - |
arithmetic_3da | acc | 0.007 | 0.00186474 | 0.0865 | 0.00628718 | - |
blimp_irregular_past_participle_adjectives | acc | 0.947 | 0.00708811 | 0.956 | 0.00648892 | - |
hendrycksTest-college_biology | acc | 0.201389 | 0.0335365 | 0.284722 | 0.0377381 | - |
hendrycksTest-college_biology | acc_norm | 0.222222 | 0.0347659 | 0.270833 | 0.0371618 | - |
headqa_en | acc | 0.324945 | 0.00894582 | 0.335522 | 0.00901875 | - |
headqa_en | acc_norm | 0.375638 | 0.00925014 | 0.383297 | 0.00928648 | |
blimp_determiner_noun_agreement_irregular_1 | acc | 0.912 | 0.00896305 | 0.944 | 0.0072744 | - |
blimp_existential_there_quantifiers_1 | acc | 0.985 | 0.00384575 | 0.981 | 0.00431945 | |
blimp_inchoative | acc | 0.653 | 0.0150605 | 0.683 | 0.0147217 | - |
mutual_plus | r@1 | 0.395034 | 0.0164328 | 0.409707 | 0.016531 | |
mutual_plus | r@2 | 0.674944 | 0.015745 | 0.680587 | 0.0156728 | |
mutual_plus | mrr | 0.632713 | 0.0103391 | 0.640801 | 0.0104141 | |
blimp_tough_vs_raising_1 | acc | 0.736 | 0.0139463 | 0.734 | 0.01398 | |
winogrande | acc | 0.636148 | 0.0135215 | 0.640884 | 0.0134831 | |
race | acc | 0.374163 | 0.0149765 | 0.37512 | 0.0149842 | |
blimp_irregular_plural_subject_verb_agreement_1 | acc | 0.908 | 0.00914438 | 0.918 | 0.00868052 | - |
hendrycksTest-high_school_macroeconomics | acc | 0.284615 | 0.0228783 | 0.284615 | 0.0228783 | |
hendrycksTest-high_school_macroeconomics | acc_norm | 0.284615 | 0.0228783 | 0.276923 | 0.022688 | |
blimp_adjunct_island | acc | 0.888 | 0.00997775 | 0.902 | 0.00940662 | - |
hendrycksTest-high_school_chemistry | acc | 0.236453 | 0.0298961 | 0.211823 | 0.028749 | |
hendrycksTest-high_school_chemistry | acc_norm | 0.300493 | 0.032258 | 0.29064 | 0.0319474 | |
arithmetic_2ds | acc | 0.051 | 0.00492053 | 0.218 | 0.00923475 | - |
blimp_principle_A_case_2 | acc | 0.955 | 0.00655881 | 0.953 | 0.00669596 | |
blimp_only_npi_licensor_present | acc | 0.926 | 0.00828206 | 0.953 | 0.00669596 | - |
math_counting_and_prob | acc | 0.0274262 | 0.00750954 | 0.0021097 | 0.0021097 | + |
cola | mcc | -0.0854256 | 0.0304519 | -0.0504508 | 0.0251594 | - |
webqs | acc | 0.023622 | 0.00336987 | 0.0226378 | 0.00330058 | |
arithmetic_4ds | acc | 0.0005 | 0.0005 | 0.0055 | 0.00165416 | - |
blimp_wh_vs_that_no_gap_long_distance | acc | 0.94 | 0.00751375 | 0.939 | 0.00757208 | |
pile_bookcorpus2 | word_perplexity | 28.7786 | 0 | 27.0559 | 0 | - |
pile_bookcorpus2 | byte_perplexity | 1.79969 | 0 | 1.78037 | 0 | - |
pile_bookcorpus2 | bits_per_byte | 0.847751 | 0 | 0.832176 | 0 | - |
blimp_sentential_negation_npi_licensor_present | acc | 0.994 | 0.00244335 | 0.982 | 0.00420639 | + |
hendrycksTest-high_school_government_and_politics | acc | 0.274611 | 0.0322102 | 0.227979 | 0.0302769 | + |
hendrycksTest-high_school_government_and_politics | acc_norm | 0.259067 | 0.0316188 | 0.248705 | 0.0311958 | |
blimp_ellipsis_n_bar_2 | acc | 0.937 | 0.00768701 | 0.916 | 0.00877616 | + |
hendrycksTest-clinical_knowledge | acc | 0.283019 | 0.0277242 | 0.267925 | 0.0272573 | |
hendrycksTest-clinical_knowledge | acc_norm | 0.343396 | 0.0292245 | 0.316981 | 0.0286372 | |
mc_taco | em | 0.125375 | 0 | 0.132883 | 0 | - |
mc_taco | f1 | 0.487131 | 0 | 0.499712 | 0 | - |
wsc | acc | 0.365385 | 0.0474473 | 0.365385 | 0.0474473 | |
hendrycksTest-college_medicine | acc | 0.231214 | 0.0321474 | 0.190751 | 0.0299579 | + |
hendrycksTest-college_medicine | acc_norm | 0.289017 | 0.0345643 | 0.265896 | 0.0336876 | |
hendrycksTest-high_school_world_history | acc | 0.295359 | 0.0296963 | 0.2827 | 0.0293128 | |
hendrycksTest-high_school_world_history | acc_norm | 0.312236 | 0.0301651 | 0.312236 | 0.0301651 | |
hendrycksTest-anatomy | acc | 0.296296 | 0.0394462 | 0.281481 | 0.03885 | |
hendrycksTest-anatomy | acc_norm | 0.288889 | 0.0391545 | 0.266667 | 0.0382017 | |
hendrycksTest-jurisprudence | acc | 0.25 | 0.0418609 | 0.277778 | 0.0433004 | |
hendrycksTest-jurisprudence | acc_norm | 0.416667 | 0.0476608 | 0.425926 | 0.0478034 | |
logiqa | acc | 0.193548 | 0.0154963 | 0.211982 | 0.016031 | - |
logiqa | acc_norm | 0.281106 | 0.0176324 | 0.291859 | 0.0178316 | |
ethics_utilitarianism_original | acc | 0.767679 | 0.00609112 | 0.941556 | 0.00338343 | - |
blimp_principle_A_c_command | acc | 0.827 | 0.0119672 | 0.81 | 0.0124119 | + |
blimp_coordinate_structure_constraint_complex_left_branch | acc | 0.794 | 0.0127956 | 0.764 | 0.0134345 | + |
arithmetic_5ds | acc | 0 | 0 | 0 | 0 | |
lambada_mt_it | ppl | 96.8846 | 5.80902 | 86.66 | 5.1869 | - |
lambada_mt_it | acc | 0.328158 | 0.00654165 | 0.336891 | 0.0065849 | - |
wsc273 | acc | 0.827839 | 0.0228905 | 0.827839 | 0.0228905 | |
blimp_coordinate_structure_constraint_object_extraction | acc | 0.852 | 0.0112349 | 0.876 | 0.0104275 | - |
blimp_principle_A_domain_3 | acc | 0.79 | 0.0128867 | 0.819 | 0.0121814 | - |
blimp_left_branch_island_echo_question | acc | 0.638 | 0.0152048 | 0.519 | 0.0158079 | + |
rte | acc | 0.534296 | 0.0300256 | 0.548736 | 0.0299531 | |
blimp_passive_2 | acc | 0.892 | 0.00982 | 0.899 | 0.00953362 | |
hendrycksTest-electrical_engineering | acc | 0.344828 | 0.0396093 | 0.358621 | 0.0399663 | |
hendrycksTest-electrical_engineering | acc_norm | 0.372414 | 0.0402873 | 0.372414 | 0.0402873 | |
sst | acc | 0.626147 | 0.0163938 | 0.493119 | 0.0169402 | + |
blimp_npi_present_1 | acc | 0.565 | 0.0156851 | 0.576 | 0.0156355 | |
piqa | acc | 0.739391 | 0.0102418 | 0.754081 | 0.0100473 | - |
piqa | acc_norm | 0.755169 | 0.0100323 | 0.761697 | 0.00994033 | |
hendrycksTest-professional_accounting | acc | 0.312057 | 0.0276401 | 0.265957 | 0.0263581 | + |
hendrycksTest-professional_accounting | acc_norm | 0.27305 | 0.0265779 | 0.22695 | 0.0249871 | + |
arc_challenge | acc | 0.325085 | 0.0136881 | 0.337884 | 0.013822 | |
arc_challenge | acc_norm | 0.352389 | 0.0139601 | 0.366041 | 0.0140772 | |
hendrycksTest-econometrics | acc | 0.263158 | 0.0414244 | 0.245614 | 0.0404934 | |
hendrycksTest-econometrics | acc_norm | 0.254386 | 0.0409699 | 0.27193 | 0.0418577 | |
headqa | acc | 0.238877 | 0.00814442 | 0.251276 | 0.0082848 | - |
headqa | acc_norm | 0.290664 | 0.00867295 | 0.286652 | 0.00863721 | |
wic | acc | 0.482759 | 0.0197989 | 0.5 | 0.0198107 | |
hendrycksTest-high_school_biology | acc | 0.270968 | 0.0252844 | 0.251613 | 0.024686 | |
hendrycksTest-high_school_biology | acc_norm | 0.274194 | 0.0253781 | 0.283871 | 0.0256494 | |
hendrycksTest-management | acc | 0.281553 | 0.0445325 | 0.23301 | 0.0418583 | + |
hendrycksTest-management | acc_norm | 0.291262 | 0.0449868 | 0.320388 | 0.0462028 | |
blimp_npi_present_2 | acc | 0.645 | 0.0151395 | 0.664 | 0.0149441 | - |
hendrycksTest-prehistory | acc | 0.265432 | 0.0245692 | 0.243827 | 0.0238919 | |
hendrycksTest-prehistory | acc_norm | 0.225309 | 0.0232462 | 0.219136 | 0.0230167 | |
hendrycksTest-world_religions | acc | 0.321637 | 0.0358253 | 0.333333 | 0.0361551 | |
hendrycksTest-world_religions | acc_norm | 0.397661 | 0.0375364 | 0.380117 | 0.0372297 | |
math_intermediate_algebra | acc | 0.00996678 | 0.00330749 | 0.00332226 | 0.00191598 | + |
anagrams2 | acc | 0.0347 | 0.00183028 | 0.0055 | 0.000739615 | + |
arc_easy | acc | 0.647306 | 0.00980442 | 0.669613 | 0.00965143 | - |
arc_easy | acc_norm | 0.609848 | 0.0100091 | 0.622896 | 0.00994504 | - |
blimp_anaphor_gender_agreement | acc | 0.993 | 0.00263779 | 0.994 | 0.00244335 | |
hendrycksTest-marketing | acc | 0.311966 | 0.0303515 | 0.307692 | 0.0302364 | |
hendrycksTest-marketing | acc_norm | 0.34188 | 0.031075 | 0.294872 | 0.0298726 | + |
blimp_principle_A_domain_1 | acc | 0.997 | 0.00173032 | 0.997 | 0.00173032 | |
blimp_wh_island | acc | 0.856 | 0.011108 | 0.852 | 0.0112349 | |
hendrycksTest-sociology | acc | 0.303483 | 0.0325101 | 0.278607 | 0.0317006 | |
hendrycksTest-sociology | acc_norm | 0.298507 | 0.0323574 | 0.318408 | 0.0329412 | |
blimp_distractor_agreement_relative_clause | acc | 0.774 | 0.0132325 | 0.719 | 0.0142212 | + |
truthfulqa_gen | bleurt_max | -0.811655 | 0.0180743 | -0.814228 | 0.0172128 | |
truthfulqa_gen | bleurt_acc | 0.395349 | 0.0171158 | 0.329253 | 0.0164513 | + |
truthfulqa_gen | bleurt_diff | -0.0488385 | 0.0204525 | -0.185905 | 0.0169617 | + |
truthfulqa_gen | bleu_max | 20.8747 | 0.717003 | 20.2238 | 0.711772 | |
truthfulqa_gen | bleu_acc | 0.330477 | 0.0164668 | 0.281518 | 0.015744 | + |
truthfulqa_gen | bleu_diff | -2.12856 | 0.832693 | -6.66121 | 0.719366 | + |
truthfulqa_gen | rouge1_max | 47.0293 | 0.962404 | 45.3457 | 0.89238 | + |
truthfulqa_gen | rouge1_acc | 0.341493 | 0.0166007 | 0.257038 | 0.0152981 | + |
truthfulqa_gen | rouge1_diff | -2.29454 | 1.2086 | -10.1049 | 0.8922 | + |
truthfulqa_gen | rouge2_max | 31.0617 | 1.08725 | 28.7438 | 0.981282 | + |
truthfulqa_gen | rouge2_acc | 0.247246 | 0.0151024 | 0.201958 | 0.014054 | + |
truthfulqa_gen | rouge2_diff | -2.84021 | 1.28749 | -11.0916 | 1.01664 | + |
truthfulqa_gen | rougeL_max | 44.6463 | 0.966119 | 42.6116 | 0.893252 | + |
truthfulqa_gen | rougeL_acc | 0.334149 | 0.0165125 | 0.24235 | 0.0150007 | + |
truthfulqa_gen | rougeL_diff | -2.50853 | 1.22016 | -10.4299 | 0.904205 | + |
hendrycksTest-public_relations | acc | 0.3 | 0.0438931 | 0.281818 | 0.0430912 | |
hendrycksTest-public_relations | acc_norm | 0.190909 | 0.0376443 | 0.163636 | 0.0354343 | |
blimp_distractor_agreement_relational_noun | acc | 0.859 | 0.0110109 | 0.833 | 0.0118004 | + |
lambada_mt_fr | ppl | 57.0379 | 3.15719 | 51.7313 | 2.90272 | - |
lambada_mt_fr | acc | 0.388512 | 0.0067906 | 0.40947 | 0.00685084 | - |
blimp_principle_A_case_1 | acc | 1 | 0 | 1 | 0 | |
hendrycksTest-medical_genetics | acc | 0.37 | 0.0485237 | 0.31 | 0.0464823 | + |
hendrycksTest-medical_genetics | acc_norm | 0.41 | 0.0494311 | 0.39 | 0.0490207 | |
qqp | acc | 0.364383 | 0.00239348 | 0.383626 | 0.00241841 | - |
qqp | f1 | 0.516391 | 0.00263674 | 0.451222 | 0.00289696 | + |
iwslt17-en-ar | bleu | 2.35563 | 0.188638 | 4.98225 | 0.275369 | - |
iwslt17-en-ar | chrf | 0.140912 | 0.00503101 | 0.277708 | 0.00415432 | - |
iwslt17-en-ar | ter | 1.0909 | 0.0122111 | 0.954701 | 0.0126737 | - |
multirc | acc | 0.0409234 | 0.00642087 | 0.0178384 | 0.00428994 | + |
hendrycksTest-human_aging | acc | 0.264574 | 0.0296051 | 0.264574 | 0.0296051 | |
hendrycksTest-human_aging | acc_norm | 0.197309 | 0.0267099 | 0.237668 | 0.0285681 | - |
reversed_words | acc | 0.0003 | 0.000173188 | 0 | 0 | + |
(Some results are missing due to errors or computational constraints.)