adding swedish_medical_ner #2873
Conversation
fix the issue in #2846
* Update README.md Changed 'Tain' to 'Train'. * add pretty_name Co-authored-by: Quentin Lhoest <[email protected]>
* Update: Openwebtext - update data files checksums * update dataset card
Show benchmarks
PyArrow==3.0.0
Show updated benchmarks!
Benchmark: benchmark_array_xd.json
metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | read_batch_unformated after write_flattened_sequence | read_batch_unformated after write_nested_sequence | read_col_formatted_as_numpy after write_array2d | read_col_formatted_as_numpy after write_flattened_sequence | read_col_formatted_as_numpy after write_nested_sequence | read_col_unformated after write_array2d | read_col_unformated after write_flattened_sequence | read_col_unformated after write_nested_sequence | read_formatted_as_numpy after write_array2d | read_formatted_as_numpy after write_flattened_sequence | read_formatted_as_numpy after write_nested_sequence | read_unformated after write_array2d | read_unformated after write_flattened_sequence | read_unformated after write_nested_sequence | write_array2d | write_flattened_sequence | write_nested_sequence |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 0.010835 / 0.011353 (-0.000517) | 0.004242 / 0.011008 (-0.006766) | 0.036916 / 0.038508 (-0.001592) | 0.040823 / 0.023109 (0.017714) | 0.352466 / 0.275898 (0.076568) | 0.394211 / 0.323480 (0.070731) | 0.009327 / 0.007986 (0.001341) | 0.005239 / 0.004328 (0.000911) | 0.010630 / 0.004250 (0.006379) | 0.050514 / 0.037052 (0.013461) | 0.356330 / 0.258489 (0.097841) | 0.405602 / 0.293841 (0.111761) | 0.026724 / 0.128546 (-0.101823) | 0.008958 / 0.075646 (-0.066689) | 0.300794 / 0.419271 (-0.118478) | 0.052408 / 0.043533 (0.008875) | 0.358838 / 0.255139 (0.103699) | 0.392754 / 0.283200 (0.109554) | 0.162942 / 0.141683 (0.021259) | 2.019208 / 1.452155 (0.567053) | 2.123980 / 1.492716 (0.631264) |
Benchmark: benchmark_getitem_100B.json
metric | get_batch_of_1024_random_rows | get_batch_of_1024_rows | get_first_row | get_last_row |
---|---|---|---|---|
new / old (diff) | 0.218963 / 0.018006 (0.200956) | 0.454086 / 0.000490 (0.453596) | 0.014336 / 0.000200 (0.014136) | 0.000102 / 0.000054 (0.000048) |
Benchmark: benchmark_indices_mapping.json
metric | select | shard | shuffle | sort | train_test_split |
---|---|---|---|---|---|
new / old (diff) | 0.043113 / 0.037411 (0.005701) | 0.027588 / 0.014526 (0.013062) | 0.032057 / 0.176557 (-0.144499) | 0.147344 / 0.737135 (-0.589791) | 0.031864 / 0.296338 (-0.264475) |
Benchmark: benchmark_iterating.json
metric | read 5000 | read 50000 | read_batch 50000 10 | read_batch 50000 100 | read_batch 50000 1000 | read_formatted numpy 5000 | read_formatted pandas 5000 | read_formatted tensorflow 5000 | read_formatted torch 5000 | read_formatted_batch numpy 5000 10 | read_formatted_batch numpy 5000 1000 | shuffled read 5000 | shuffled read 50000 | shuffled read_batch 50000 10 | shuffled read_batch 50000 100 | shuffled read_batch 50000 1000 | shuffled read_formatted numpy 5000 | shuffled read_formatted_batch numpy 5000 10 | shuffled read_formatted_batch numpy 5000 1000 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 0.430227 / 0.215209 (0.215018) | 4.206032 / 2.077655 (2.128377) | 2.212512 / 1.504120 (0.708393) | 2.074008 / 1.541195 (0.532814) | 2.113160 / 1.468490 (0.644670) | 0.369798 / 4.584777 (-4.214979) | 5.209165 / 3.745712 (1.463453) | 6.082661 / 5.269862 (0.812800) | 3.766156 / 4.565676 (-0.799520) | 0.043588 / 0.424275 (-0.380687) | 0.006442 / 0.007607 (-0.001166) | 0.557979 / 0.226044 (0.331935) | 5.616704 / 2.268929 (3.347776) | 2.747667 / 55.444624 (-52.696957) | 2.261781 / 6.876477 (-4.614696) | 2.310395 / 2.142072 (0.168323) | 0.503469 / 4.805227 (-4.301758) | 0.117947 / 6.500664 (-6.382717) | 0.061702 / 0.075469 (-0.013767) |
Benchmark: benchmark_map_filter.json
metric | filter | map fast-tokenizer batched | map identity | map identity batched | map no-op batched | map no-op batched numpy | map no-op batched pandas | map no-op batched pytorch | map no-op batched tensorflow |
---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 120.032847 / 1.841788 (118.191059) | 14.919063 / 8.074308 (6.844755) | 30.924377 / 10.191392 (20.732985) | 0.862032 / 0.680424 (0.181608) | 0.600701 / 0.534201 (0.066500) | 0.260174 / 0.579283 (-0.319110) | 0.583264 / 0.434364 (0.148900) | 0.373701 / 0.540337 (-0.166637) | 1.231935 / 1.386936 (-0.155001) |
Show updated benchmarks!
Benchmark: benchmark_array_xd.json
metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | read_batch_unformated after write_flattened_sequence | read_batch_unformated after write_nested_sequence | read_col_formatted_as_numpy after write_array2d | read_col_formatted_as_numpy after write_flattened_sequence | read_col_formatted_as_numpy after write_nested_sequence | read_col_unformated after write_array2d | read_col_unformated after write_flattened_sequence | read_col_unformated after write_nested_sequence | read_formatted_as_numpy after write_array2d | read_formatted_as_numpy after write_flattened_sequence | read_formatted_as_numpy after write_nested_sequence | read_unformated after write_array2d | read_unformated after write_flattened_sequence | read_unformated after write_nested_sequence | write_array2d | write_flattened_sequence | write_nested_sequence |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 0.010727 / 0.011353 (-0.000626) | 0.004111 / 0.011008 (-0.006897) | 0.036678 / 0.038508 (-0.001830) | 0.041482 / 0.023109 (0.018372) | 0.344385 / 0.275898 (0.068487) | 0.385757 / 0.323480 (0.062277) | 0.009523 / 0.007986 (0.001538) | 0.005459 / 0.004328 (0.001131) | 0.010607 / 0.004250 (0.006356) | 0.048749 / 0.037052 (0.011696) | 0.339073 / 0.258489 (0.080584) | 0.390318 / 0.293841 (0.096477) | 0.027357 / 0.128546 (-0.101190) | 0.008829 / 0.075646 (-0.066818) | 0.303201 / 0.419271 (-0.116071) | 0.054252 / 0.043533 (0.010719) | 0.337872 / 0.255139 (0.082733) | 0.394824 / 0.283200 (0.111625) | 0.155547 / 0.141683 (0.013864) | 2.010645 / 1.452155 (0.558491) | 2.099573 / 1.492716 (0.606857) |
Benchmark: benchmark_getitem_100B.json
metric | get_batch_of_1024_random_rows | get_batch_of_1024_rows | get_first_row | get_last_row |
---|---|---|---|---|
new / old (diff) | 0.414019 / 0.018006 (0.396013) | 0.478181 / 0.000490 (0.477692) | 0.068522 / 0.000200 (0.068322) | 0.000353 / 0.000054 (0.000299) |
Benchmark: benchmark_indices_mapping.json
metric | select | shard | shuffle | sort | train_test_split |
---|---|---|---|---|---|
new / old (diff) | 0.042385 / 0.037411 (0.004973) | 0.025997 / 0.014526 (0.011471) | 0.029807 / 0.176557 (-0.146750) | 0.148218 / 0.737135 (-0.588917) | 0.032039 / 0.296338 (-0.264300) |
Benchmark: benchmark_iterating.json
metric | read 5000 | read 50000 | read_batch 50000 10 | read_batch 50000 100 | read_batch 50000 1000 | read_formatted numpy 5000 | read_formatted pandas 5000 | read_formatted tensorflow 5000 | read_formatted torch 5000 | read_formatted_batch numpy 5000 10 | read_formatted_batch numpy 5000 1000 | shuffled read 5000 | shuffled read 50000 | shuffled read_batch 50000 10 | shuffled read_batch 50000 100 | shuffled read_batch 50000 1000 | shuffled read_formatted numpy 5000 | shuffled read_formatted_batch numpy 5000 10 | shuffled read_formatted_batch numpy 5000 1000 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 0.409857 / 0.215209 (0.194647) | 4.013152 / 2.077655 (1.935498) | 2.008397 / 1.504120 (0.504277) | 1.826900 / 1.541195 (0.285705) | 1.911383 / 1.468490 (0.442893) | 0.357248 / 4.584777 (-4.227529) | 5.193846 / 3.745712 (1.448134) | 5.003012 / 5.269862 (-0.266850) | 2.982003 / 4.565676 (-1.583673) | 0.042828 / 0.424275 (-0.381447) | 0.007060 / 0.007607 (-0.000547) | 0.541243 / 0.226044 (0.315199) | 5.321754 / 2.268929 (3.052826) | 2.618005 / 55.444624 (-52.826620) | 2.220121 / 6.876477 (-4.656356) | 2.249974 / 2.142072 (0.107902) | 0.487443 / 4.805227 (-4.317784) | 0.113382 / 6.500664 (-6.387282) | 0.136074 / 0.075469 (0.060605) |
Benchmark: benchmark_map_filter.json
metric | filter | map fast-tokenizer batched | map identity | map identity batched | map no-op batched | map no-op batched numpy | map no-op batched pandas | map no-op batched pytorch | map no-op batched tensorflow |
---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 126.009356 / 1.841788 (124.167568) | 15.135564 / 8.074308 (7.061256) | 30.782740 / 10.191392 (20.591348) | 0.848738 / 0.680424 (0.168314) | 0.592777 / 0.534201 (0.058576) | 0.273661 / 0.579283 (-0.305622) | 0.594040 / 0.434364 (0.159676) | 0.441454 / 0.540337 (-0.098883) | 1.224994 / 1.386936 (-0.161942) |
* Test xpathjoin * Implement xpathjoin * Use xpathjoin to patch Path joinpath/__truediv__ * Update docstring of extend_module_for_streaming * Test xpathopen * Implement xpathopen * Use xpathopen to patch Path.open * Clean tests for streaming * Test _as_posix * Fix _as_posix for hops starting with slash * Add docstrings * Test xjoin with local paths
Show benchmarks
PyArrow==3.0.0
Show updated benchmarks!
Benchmark: benchmark_array_xd.json
metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | read_batch_unformated after write_flattened_sequence | read_batch_unformated after write_nested_sequence | read_col_formatted_as_numpy after write_array2d | read_col_formatted_as_numpy after write_flattened_sequence | read_col_formatted_as_numpy after write_nested_sequence | read_col_unformated after write_array2d | read_col_unformated after write_flattened_sequence | read_col_unformated after write_nested_sequence | read_formatted_as_numpy after write_array2d | read_formatted_as_numpy after write_flattened_sequence | read_formatted_as_numpy after write_nested_sequence | read_unformated after write_array2d | read_unformated after write_flattened_sequence | read_unformated after write_nested_sequence | write_array2d | write_flattened_sequence | write_nested_sequence |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 0.011487 / 0.011353 (0.000134) | 0.004450 / 0.011008 (-0.006558) | 0.040659 / 0.038508 (0.002151) | 0.037813 / 0.023109 (0.014704) | 0.348414 / 0.275898 (0.072516) | 0.391464 / 0.323480 (0.067984) | 0.010056 / 0.007986 (0.002071) | 0.005401 / 0.004328 (0.001072) | 0.010109 / 0.004250 (0.005858) | 0.047616 / 0.037052 (0.010563) | 0.360686 / 0.258489 (0.102197) | 0.391118 / 0.293841 (0.097277) | 0.032594 / 0.128546 (-0.095952) | 0.011619 / 0.075646 (-0.064027) | 0.312985 / 0.419271 (-0.106287) | 0.052712 / 0.043533 (0.009179) | 0.395351 / 0.255139 (0.140212) | 0.390973 / 0.283200 (0.107774) | 0.152909 / 0.141683 (0.011227) | 2.118943 / 1.452155 (0.666789) | 2.087098 / 1.492716 (0.594382) |
Benchmark: benchmark_getitem_100B.json
metric | get_batch_of_1024_random_rows | get_batch_of_1024_rows | get_first_row | get_last_row |
---|---|---|---|---|
new / old (diff) | 0.230118 / 0.018006 (0.212112) | 0.556266 / 0.000490 (0.555777) | 0.005600 / 0.000200 (0.005400) | 0.000120 / 0.000054 (0.000066) |
Benchmark: benchmark_indices_mapping.json
metric | select | shard | shuffle | sort | train_test_split |
---|---|---|---|---|---|
new / old (diff) | 0.042019 / 0.037411 (0.004608) | 0.029746 / 0.014526 (0.015220) | 0.029159 / 0.176557 (-0.147397) | 0.139175 / 0.737135 (-0.597960) | 0.031614 / 0.296338 (-0.264725) |
Benchmark: benchmark_iterating.json
metric | read 5000 | read 50000 | read_batch 50000 10 | read_batch 50000 100 | read_batch 50000 1000 | read_formatted numpy 5000 | read_formatted pandas 5000 | read_formatted tensorflow 5000 | read_formatted torch 5000 | read_formatted_batch numpy 5000 10 | read_formatted_batch numpy 5000 1000 | shuffled read 5000 | shuffled read 50000 | shuffled read_batch 50000 10 | shuffled read_batch 50000 100 | shuffled read_batch 50000 1000 | shuffled read_formatted numpy 5000 | shuffled read_formatted_batch numpy 5000 10 | shuffled read_formatted_batch numpy 5000 1000 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 0.543627 / 0.215209 (0.328418) | 4.787100 / 2.077655 (2.709446) | 2.262434 / 1.504120 (0.758314) | 1.950349 / 1.541195 (0.409154) | 2.020002 / 1.468490 (0.551512) | 0.505519 / 4.584777 (-4.079258) | 6.367098 / 3.745712 (2.621385) | 6.007513 / 5.269862 (0.737652) | 3.087287 / 4.565676 (-1.478389) | 0.057883 / 0.424275 (-0.366393) | 0.006543 / 0.007607 (-0.001065) | 0.627519 / 0.226044 (0.401474) | 6.350018 / 2.268929 (4.081090) | 2.947091 / 55.444624 (-52.497533) | 2.308398 / 6.876477 (-4.568078) | 2.347615 / 2.142072 (0.205542) | 0.677322 / 4.805227 (-4.127905) | 0.148305 / 6.500664 (-6.352359) | 0.063507 / 0.075469 (-0.011962) |
Benchmark: benchmark_map_filter.json
metric | filter | map fast-tokenizer batched | map identity | map identity batched | map no-op batched | map no-op batched numpy | map no-op batched pandas | map no-op batched pytorch | map no-op batched tensorflow |
---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 163.304494 / 1.841788 (161.462706) | 14.500697 / 8.074308 (6.426389) | 41.443007 / 10.191392 (31.251614) | 0.832287 / 0.680424 (0.151863) | 0.650598 / 0.534201 (0.116397) | 0.271475 / 0.579283 (-0.307808) | 0.679078 / 0.434364 (0.244714) | 0.446581 / 0.540337 (-0.093756) | 1.316782 / 1.386936 (-0.070154) |
Show updated benchmarks!
Benchmark: benchmark_array_xd.json
metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | read_batch_unformated after write_flattened_sequence | read_batch_unformated after write_nested_sequence | read_col_formatted_as_numpy after write_array2d | read_col_formatted_as_numpy after write_flattened_sequence | read_col_formatted_as_numpy after write_nested_sequence | read_col_unformated after write_array2d | read_col_unformated after write_flattened_sequence | read_col_unformated after write_nested_sequence | read_formatted_as_numpy after write_array2d | read_formatted_as_numpy after write_flattened_sequence | read_formatted_as_numpy after write_nested_sequence | read_unformated after write_array2d | read_unformated after write_flattened_sequence | read_unformated after write_nested_sequence | write_array2d | write_flattened_sequence | write_nested_sequence |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 0.013660 / 0.011353 (0.002307) | 0.004650 / 0.011008 (-0.006359) | 0.041068 / 0.038508 (0.002560) | 0.039261 / 0.023109 (0.016152) | 0.421292 / 0.275898 (0.145394) | 0.396053 / 0.323480 (0.072573) | 0.008763 / 0.007986 (0.000777) | 0.005943 / 0.004328 (0.001615) | 0.011085 / 0.004250 (0.006835) | 0.048923 / 0.037052 (0.011870) | 0.387800 / 0.258489 (0.129311) | 0.419130 / 0.293841 (0.125289) | 0.034637 / 0.128546 (-0.093910) | 0.012447 / 0.075646 (-0.063199) | 0.308529 / 0.419271 (-0.110742) | 0.056256 / 0.043533 (0.012724) | 0.399335 / 0.255139 (0.144196) | 0.413225 / 0.283200 (0.130026) | 0.112832 / 0.141683 (-0.028851) | 2.169108 / 1.452155 (0.716953) | 2.127485 / 1.492716 (0.634769) |
Benchmark: benchmark_getitem_100B.json
metric | get_batch_of_1024_random_rows | get_batch_of_1024_rows | get_first_row | get_last_row |
---|---|---|---|---|
new / old (diff) | 0.452437 / 0.018006 (0.434430) | 0.656038 / 0.000490 (0.655548) | 0.068794 / 0.000200 (0.068594) | 0.000365 / 0.000054 (0.000311) |
Benchmark: benchmark_indices_mapping.json
metric | select | shard | shuffle | sort | train_test_split |
---|---|---|---|---|---|
new / old (diff) | 0.039865 / 0.037411 (0.002453) | 0.028611 / 0.014526 (0.014085) | 0.027908 / 0.176557 (-0.148649) | 0.133649 / 0.737135 (-0.603487) | 0.030044 / 0.296338 (-0.266295) |
Benchmark: benchmark_iterating.json
metric | read 5000 | read 50000 | read_batch 50000 10 | read_batch 50000 100 | read_batch 50000 1000 | read_formatted numpy 5000 | read_formatted pandas 5000 | read_formatted tensorflow 5000 | read_formatted torch 5000 | read_formatted_batch numpy 5000 10 | read_formatted_batch numpy 5000 1000 | shuffled read 5000 | shuffled read 50000 | shuffled read_batch 50000 10 | shuffled read_batch 50000 100 | shuffled read_batch 50000 1000 | shuffled read_formatted numpy 5000 | shuffled read_formatted_batch numpy 5000 10 | shuffled read_formatted_batch numpy 5000 1000 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 0.524565 / 0.215209 (0.309355) | 5.145593 / 2.077655 (3.067938) | 2.383120 / 1.504120 (0.879000) | 2.001321 / 1.541195 (0.460126) | 1.989900 / 1.468490 (0.521410) | 0.512648 / 4.584777 (-4.072129) | 6.117416 / 3.745712 (2.371704) | 7.609945 / 5.269862 (2.340083) | 3.898983 / 4.565676 (-0.666693) | 0.061152 / 0.424275 (-0.363123) | 0.007940 / 0.007607 (0.000333) | 0.691820 / 0.226044 (0.465776) | 6.418109 / 2.268929 (4.149180) | 3.001429 / 55.444624 (-52.443195) | 2.314800 / 6.876477 (-4.561677) | 2.381399 / 2.142072 (0.239326) | 0.696329 / 4.805227 (-4.108898) | 0.179426 / 6.500664 (-6.321238) | 0.131597 / 0.075469 (0.056128) |
Benchmark: benchmark_map_filter.json
metric | filter | map fast-tokenizer batched | map identity | map identity batched | map no-op batched | map no-op batched numpy | map no-op batched pandas | map no-op batched pytorch | map no-op batched tensorflow |
---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 164.420213 / 1.841788 (162.578426) | 15.062505 / 8.074308 (6.988197) | 42.673592 / 10.191392 (32.482200) | 0.939125 / 0.680424 (0.258701) | 0.634342 / 0.534201 (0.100142) | 0.278813 / 0.579283 (-0.300471) | 0.693196 / 0.434364 (0.258833) | 0.410685 / 0.540337 (-0.129653) | 1.370317 / 1.386936 (-0.016619) |
* make timit_asr streamable * update docs about dirname * fix test * fix tests * style * fix windows test * again
Show benchmarks
PyArrow==3.0.0
Show updated benchmarks!
Benchmark: benchmark_array_xd.json
metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | read_batch_unformated after write_flattened_sequence | read_batch_unformated after write_nested_sequence | read_col_formatted_as_numpy after write_array2d | read_col_formatted_as_numpy after write_flattened_sequence | read_col_formatted_as_numpy after write_nested_sequence | read_col_unformated after write_array2d | read_col_unformated after write_flattened_sequence | read_col_unformated after write_nested_sequence | read_formatted_as_numpy after write_array2d | read_formatted_as_numpy after write_flattened_sequence | read_formatted_as_numpy after write_nested_sequence | read_unformated after write_array2d | read_unformated after write_flattened_sequence | read_unformated after write_nested_sequence | write_array2d | write_flattened_sequence | write_nested_sequence |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 0.010680 / 0.011353 (-0.000673) | 0.003895 / 0.011008 (-0.007113) | 0.036106 / 0.038508 (-0.002402) | 0.041874 / 0.023109 (0.018765) | 0.375704 / 0.275898 (0.099806) | 0.444731 / 0.323480 (0.121251) | 0.009310 / 0.007986 (0.001324) | 0.005135 / 0.004328 (0.000807) | 0.010580 / 0.004250 (0.006329) | 0.050894 / 0.037052 (0.013841) | 0.365591 / 0.258489 (0.107102) | 0.450586 / 0.293841 (0.156745) | 0.026941 / 0.128546 (-0.101605) | 0.008527 / 0.075646 (-0.067119) | 0.295128 / 0.419271 (-0.124144) | 0.053536 / 0.043533 (0.010003) | 0.376496 / 0.255139 (0.121357) | 0.428934 / 0.283200 (0.145735) | 0.161137 / 0.141683 (0.019455) | 1.970755 / 1.452155 (0.518601) | 2.038220 / 1.492716 (0.545504) |
Benchmark: benchmark_getitem_100B.json
metric | get_batch_of_1024_random_rows | get_batch_of_1024_rows | get_first_row | get_last_row |
---|---|---|---|---|
new / old (diff) | 0.208152 / 0.018006 (0.190146) | 0.474976 / 0.000490 (0.474486) | 0.005298 / 0.000200 (0.005098) | 0.000081 / 0.000054 (0.000026) |
Benchmark: benchmark_indices_mapping.json
metric | select | shard | shuffle | sort | train_test_split |
---|---|---|---|---|---|
new / old (diff) | 0.043047 / 0.037411 (0.005636) | 0.027841 / 0.014526 (0.013315) | 0.029407 / 0.176557 (-0.147149) | 0.146158 / 0.737135 (-0.590977) | 0.031797 / 0.296338 (-0.264542) |
Benchmark: benchmark_iterating.json
metric | read 5000 | read 50000 | read_batch 50000 10 | read_batch 50000 100 | read_batch 50000 1000 | read_formatted numpy 5000 | read_formatted pandas 5000 | read_formatted tensorflow 5000 | read_formatted torch 5000 | read_formatted_batch numpy 5000 10 | read_formatted_batch numpy 5000 1000 | shuffled read 5000 | shuffled read 50000 | shuffled read_batch 50000 10 | shuffled read_batch 50000 100 | shuffled read_batch 50000 1000 | shuffled read_formatted numpy 5000 | shuffled read_formatted_batch numpy 5000 10 | shuffled read_formatted_batch numpy 5000 1000 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 0.407516 / 0.215209 (0.192307) | 4.039315 / 2.077655 (1.961661) | 2.061690 / 1.504120 (0.557570) | 1.862885 / 1.541195 (0.321691) | 1.932060 / 1.468490 (0.463569) | 0.362821 / 4.584777 (-4.221956) | 5.216290 / 3.745712 (1.470577) | 5.090934 / 5.269862 (-0.178928) | 2.468490 / 4.565676 (-2.097187) | 0.043290 / 0.424275 (-0.380985) | 0.006253 / 0.007607 (-0.001354) | 0.523349 / 0.226044 (0.297305) | 5.242612 / 2.268929 (2.973684) | 2.576842 / 55.444624 (-52.867783) | 2.176176 / 6.876477 (-4.700301) | 2.236019 / 2.142072 (0.093946) | 0.493755 / 4.805227 (-4.311472) | 0.117713 / 6.500664 (-6.382951) | 0.136697 / 0.075469 (0.061228) |
Benchmark: benchmark_map_filter.json
metric | filter | map fast-tokenizer batched | map identity | map identity batched | map no-op batched | map no-op batched numpy | map no-op batched pandas | map no-op batched pytorch | map no-op batched tensorflow |
---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 114.620037 / 1.841788 (112.778249) | 14.820611 / 8.074308 (6.746303) | 30.818470 / 10.191392 (20.627078) | 0.870967 / 0.680424 (0.190544) | 0.596853 / 0.534201 (0.062652) | 0.264001 / 0.579283 (-0.315282) | 0.578735 / 0.434364 (0.144371) | 0.359360 / 0.540337 (-0.180977) | 1.194462 / 1.386936 (-0.192474) |
Show updated benchmarks!
Benchmark: benchmark_array_xd.json
metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | read_batch_unformated after write_flattened_sequence | read_batch_unformated after write_nested_sequence | read_col_formatted_as_numpy after write_array2d | read_col_formatted_as_numpy after write_flattened_sequence | read_col_formatted_as_numpy after write_nested_sequence | read_col_unformated after write_array2d | read_col_unformated after write_flattened_sequence | read_col_unformated after write_nested_sequence | read_formatted_as_numpy after write_array2d | read_formatted_as_numpy after write_flattened_sequence | read_formatted_as_numpy after write_nested_sequence | read_unformated after write_array2d | read_unformated after write_flattened_sequence | read_unformated after write_nested_sequence | write_array2d | write_flattened_sequence | write_nested_sequence |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 0.010930 / 0.011353 (-0.000423) | 0.004151 / 0.011008 (-0.006857) | 0.037303 / 0.038508 (-0.001205) | 0.041966 / 0.023109 (0.018857) | 0.341840 / 0.275898 (0.065942) | 0.383249 / 0.323480 (0.059769) | 0.009455 / 0.007986 (0.001469) | 0.005255 / 0.004328 (0.000927) | 0.010561 / 0.004250 (0.006311) | 0.047801 / 0.037052 (0.010749) | 0.328783 / 0.258489 (0.070294) | 0.382246 / 0.293841 (0.088405) | 0.027305 / 0.128546 (-0.101241) | 0.008726 / 0.075646 (-0.066920) | 0.300714 / 0.419271 (-0.118558) | 0.053465 / 0.043533 (0.009932) | 0.340268 / 0.255139 (0.085129) | 0.364547 / 0.283200 (0.081348) | 0.119441 / 0.141683 (-0.022242) | 2.043396 / 1.452155 (0.591241) | 2.089921 / 1.492716 (0.597205) |
Benchmark: benchmark_getitem_100B.json
metric | get_batch_of_1024_random_rows | get_batch_of_1024_rows | get_first_row | get_last_row |
---|---|---|---|---|
new / old (diff) | 0.347428 / 0.018006 (0.329421) | 0.480203 / 0.000490 (0.479713) | 0.046511 / 0.000200 (0.046312) | 0.000306 / 0.000054 (0.000251) |
Benchmark: benchmark_indices_mapping.json
metric | select | shard | shuffle | sort | train_test_split |
---|---|---|---|---|---|
new / old (diff) | 0.041756 / 0.037411 (0.004345) | 0.025578 / 0.014526 (0.011052) | 0.029624 / 0.176557 (-0.146933) | 0.148213 / 0.737135 (-0.588923) | 0.032081 / 0.296338 (-0.264257) |
Benchmark: benchmark_iterating.json
metric | read 5000 | read 50000 | read_batch 50000 10 | read_batch 50000 100 | read_batch 50000 1000 | read_formatted numpy 5000 | read_formatted pandas 5000 | read_formatted tensorflow 5000 | read_formatted torch 5000 | read_formatted_batch numpy 5000 10 | read_formatted_batch numpy 5000 1000 | shuffled read 5000 | shuffled read 50000 | shuffled read_batch 50000 10 | shuffled read_batch 50000 100 | shuffled read_batch 50000 1000 | shuffled read_formatted numpy 5000 | shuffled read_formatted_batch numpy 5000 10 | shuffled read_formatted_batch numpy 5000 1000 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 0.412680 / 0.215209 (0.197471) | 4.117487 / 2.077655 (2.039833) | 2.123385 / 1.504120 (0.619265) | 1.920258 / 1.541195 (0.379064) | 2.006321 / 1.468490 (0.537831) | 0.361922 / 4.584777 (-4.222855) | 5.234374 / 3.745712 (1.488662) | 5.666306 / 5.269862 (0.396444) | 2.734976 / 4.565676 (-1.830700) | 0.042486 / 0.424275 (-0.381789) | 0.006339 / 0.007607 (-0.001268) | 0.540043 / 0.226044 (0.313998) | 5.355188 / 2.268929 (3.086259) | 2.671628 / 55.444624 (-52.772997) | 2.258153 / 6.876477 (-4.618324) | 2.309141 / 2.142072 (0.167068) | 0.510575 / 4.805227 (-4.294653) | 0.117660 / 6.500664 (-6.383005) | 0.136889 / 0.075469 (0.061420) |
Benchmark: benchmark_map_filter.json
metric | filter | map fast-tokenizer batched | map identity | map identity batched | map no-op batched | map no-op batched numpy | map no-op batched pandas | map no-op batched pytorch | map no-op batched tensorflow |
---|---|---|---|---|---|---|---|---|---|
new / old (diff) | 125.026040 / 1.841788 (123.184252) | 14.943394 / 8.074308 (6.869086) | 30.563270 / 10.191392 (20.371878) | 0.832791 / 0.680424 (0.152367) | 0.588618 / 0.534201 (0.054417) | 0.263443 / 0.579283 (-0.315840) | 0.588013 / 0.434364 (0.153649) | 0.427603 / 0.540337 (-0.112735) | 1.216731 / 1.386936 (-0.170205) |
The non passing test is about the dataset card. Some information is missing:
Data Fields
sectionpretty_name
tag
if error_messages:
> raise ValueError("\n".join(error_messages))
E ValueError: The following issues have been found in the dataset cards:
E README Validation:
E The following issues were found for the README at `/home/circleci/datasets/datasets/swedish_medical_ner/README.md`:
E - Expected some content in section `Data Fields` but it is empty.
E The following issues have been found in the dataset cards:
E YAML tags:
E __init__() missing 1 required positional argument: 'pretty_name'
Hi, what's the current status of this request? It says Changes requested, but I can't see what changes? |
Hi, it looks like this PR includes changes to other files that Feel free to remove these changes, or simply create a new PR that only contains the addition of the dataset |
Thanks a lot for adding this dataset ! The dataset card and dataset script look all good to me, good job :)
I just added one comment.
Also feel free to ping me when your have removed the other changes or created a new PR
|
||
## Dataset Structure | ||
|
||
### Data Instances |
In this section we expect to see an actual example from the dataset, as it is when people use the dataset
Feel free to get one example using load_dataset
and dataset["train"][0]
and put it here :)
Adding the Swedish Medical NER dataset, listed in "Biomedical Datasets - BigScience Workshop 2021"
Code refactored
The text was updated successfully, but these errors were encountered: