The Wayback Machine - https://web.archive.org/web/20211014014617/https://github.com/huggingface/datasets/pull/2873
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding swedish_medical_ner #2873

Closed
wants to merge 15 commits into from
Closed

adding swedish_medical_ner #2873

wants to merge 15 commits into from

Conversation

@bwang482
Copy link
Contributor

@bwang482 bwang482 commented Sep 7, 2021

Adding the Swedish Medical NER dataset, listed in "Biomedical Datasets - BigScience Workshop 2021"

Code refactored

bwang482 and others added 5 commits Sep 6, 2021
* Update README.md

Changed 'Tain' to 'Train'.

* add pretty_name

Co-authored-by: Quentin Lhoest <[email protected]>
* Update: Openwebtext - update data files checksums

* update dataset card
github-actions[bot]
Copy link

github-actions bot commented on 9ca2425 Sep 7, 2021

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010835 / 0.011353 (-0.000517) 0.004242 / 0.011008 (-0.006766) 0.036916 / 0.038508 (-0.001592) 0.040823 / 0.023109 (0.017714) 0.352466 / 0.275898 (0.076568) 0.394211 / 0.323480 (0.070731) 0.009327 / 0.007986 (0.001341) 0.005239 / 0.004328 (0.000911) 0.010630 / 0.004250 (0.006379) 0.050514 / 0.037052 (0.013461) 0.356330 / 0.258489 (0.097841) 0.405602 / 0.293841 (0.111761) 0.026724 / 0.128546 (-0.101823) 0.008958 / 0.075646 (-0.066689) 0.300794 / 0.419271 (-0.118478) 0.052408 / 0.043533 (0.008875) 0.358838 / 0.255139 (0.103699) 0.392754 / 0.283200 (0.109554) 0.162942 / 0.141683 (0.021259) 2.019208 / 1.452155 (0.567053) 2.123980 / 1.492716 (0.631264)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.218963 / 0.018006 (0.200956) 0.454086 / 0.000490 (0.453596) 0.014336 / 0.000200 (0.014136) 0.000102 / 0.000054 (0.000048)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.043113 / 0.037411 (0.005701) 0.027588 / 0.014526 (0.013062) 0.032057 / 0.176557 (-0.144499) 0.147344 / 0.737135 (-0.589791) 0.031864 / 0.296338 (-0.264475)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.430227 / 0.215209 (0.215018) 4.206032 / 2.077655 (2.128377) 2.212512 / 1.504120 (0.708393) 2.074008 / 1.541195 (0.532814) 2.113160 / 1.468490 (0.644670) 0.369798 / 4.584777 (-4.214979) 5.209165 / 3.745712 (1.463453) 6.082661 / 5.269862 (0.812800) 3.766156 / 4.565676 (-0.799520) 0.043588 / 0.424275 (-0.380687) 0.006442 / 0.007607 (-0.001166) 0.557979 / 0.226044 (0.331935) 5.616704 / 2.268929 (3.347776) 2.747667 / 55.444624 (-52.696957) 2.261781 / 6.876477 (-4.614696) 2.310395 / 2.142072 (0.168323) 0.503469 / 4.805227 (-4.301758) 0.117947 / 6.500664 (-6.382717) 0.061702 / 0.075469 (-0.013767)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 120.032847 / 1.841788 (118.191059) 14.919063 / 8.074308 (6.844755) 30.924377 / 10.191392 (20.732985) 0.862032 / 0.680424 (0.181608) 0.600701 / 0.534201 (0.066500) 0.260174 / 0.579283 (-0.319110) 0.583264 / 0.434364 (0.148900) 0.373701 / 0.540337 (-0.166637) 1.231935 / 1.386936 (-0.155001)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010727 / 0.011353 (-0.000626) 0.004111 / 0.011008 (-0.006897) 0.036678 / 0.038508 (-0.001830) 0.041482 / 0.023109 (0.018372) 0.344385 / 0.275898 (0.068487) 0.385757 / 0.323480 (0.062277) 0.009523 / 0.007986 (0.001538) 0.005459 / 0.004328 (0.001131) 0.010607 / 0.004250 (0.006356) 0.048749 / 0.037052 (0.011696) 0.339073 / 0.258489 (0.080584) 0.390318 / 0.293841 (0.096477) 0.027357 / 0.128546 (-0.101190) 0.008829 / 0.075646 (-0.066818) 0.303201 / 0.419271 (-0.116071) 0.054252 / 0.043533 (0.010719) 0.337872 / 0.255139 (0.082733) 0.394824 / 0.283200 (0.111625) 0.155547 / 0.141683 (0.013864) 2.010645 / 1.452155 (0.558491) 2.099573 / 1.492716 (0.606857)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.414019 / 0.018006 (0.396013) 0.478181 / 0.000490 (0.477692) 0.068522 / 0.000200 (0.068322) 0.000353 / 0.000054 (0.000299)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.042385 / 0.037411 (0.004973) 0.025997 / 0.014526 (0.011471) 0.029807 / 0.176557 (-0.146750) 0.148218 / 0.737135 (-0.588917) 0.032039 / 0.296338 (-0.264300)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.409857 / 0.215209 (0.194647) 4.013152 / 2.077655 (1.935498) 2.008397 / 1.504120 (0.504277) 1.826900 / 1.541195 (0.285705) 1.911383 / 1.468490 (0.442893) 0.357248 / 4.584777 (-4.227529) 5.193846 / 3.745712 (1.448134) 5.003012 / 5.269862 (-0.266850) 2.982003 / 4.565676 (-1.583673) 0.042828 / 0.424275 (-0.381447) 0.007060 / 0.007607 (-0.000547) 0.541243 / 0.226044 (0.315199) 5.321754 / 2.268929 (3.052826) 2.618005 / 55.444624 (-52.826620) 2.220121 / 6.876477 (-4.656356) 2.249974 / 2.142072 (0.107902) 0.487443 / 4.805227 (-4.317784) 0.113382 / 6.500664 (-6.387282) 0.136074 / 0.075469 (0.060605)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 126.009356 / 1.841788 (124.167568) 15.135564 / 8.074308 (7.061256) 30.782740 / 10.191392 (20.591348) 0.848738 / 0.680424 (0.168314) 0.592777 / 0.534201 (0.058576) 0.273661 / 0.579283 (-0.305622) 0.594040 / 0.434364 (0.159676) 0.441454 / 0.540337 (-0.098883) 1.224994 / 1.386936 (-0.161942)

CML watermark

* Test xpathjoin

* Implement xpathjoin

* Use xpathjoin to patch Path joinpath/__truediv__

* Update docstring of extend_module_for_streaming

* Test xpathopen

* Implement xpathopen

* Use xpathopen to patch Path.open

* Clean tests for streaming

* Test _as_posix

* Fix _as_posix for hops starting with slash

* Add docstrings

* Test xjoin with local paths
github-actions[bot]
Copy link

github-actions bot commented on 486e7ba Sep 7, 2021

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.011487 / 0.011353 (0.000134) 0.004450 / 0.011008 (-0.006558) 0.040659 / 0.038508 (0.002151) 0.037813 / 0.023109 (0.014704) 0.348414 / 0.275898 (0.072516) 0.391464 / 0.323480 (0.067984) 0.010056 / 0.007986 (0.002071) 0.005401 / 0.004328 (0.001072) 0.010109 / 0.004250 (0.005858) 0.047616 / 0.037052 (0.010563) 0.360686 / 0.258489 (0.102197) 0.391118 / 0.293841 (0.097277) 0.032594 / 0.128546 (-0.095952) 0.011619 / 0.075646 (-0.064027) 0.312985 / 0.419271 (-0.106287) 0.052712 / 0.043533 (0.009179) 0.395351 / 0.255139 (0.140212) 0.390973 / 0.283200 (0.107774) 0.152909 / 0.141683 (0.011227) 2.118943 / 1.452155 (0.666789) 2.087098 / 1.492716 (0.594382)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.230118 / 0.018006 (0.212112) 0.556266 / 0.000490 (0.555777) 0.005600 / 0.000200 (0.005400) 0.000120 / 0.000054 (0.000066)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.042019 / 0.037411 (0.004608) 0.029746 / 0.014526 (0.015220) 0.029159 / 0.176557 (-0.147397) 0.139175 / 0.737135 (-0.597960) 0.031614 / 0.296338 (-0.264725)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.543627 / 0.215209 (0.328418) 4.787100 / 2.077655 (2.709446) 2.262434 / 1.504120 (0.758314) 1.950349 / 1.541195 (0.409154) 2.020002 / 1.468490 (0.551512) 0.505519 / 4.584777 (-4.079258) 6.367098 / 3.745712 (2.621385) 6.007513 / 5.269862 (0.737652) 3.087287 / 4.565676 (-1.478389) 0.057883 / 0.424275 (-0.366393) 0.006543 / 0.007607 (-0.001065) 0.627519 / 0.226044 (0.401474) 6.350018 / 2.268929 (4.081090) 2.947091 / 55.444624 (-52.497533) 2.308398 / 6.876477 (-4.568078) 2.347615 / 2.142072 (0.205542) 0.677322 / 4.805227 (-4.127905) 0.148305 / 6.500664 (-6.352359) 0.063507 / 0.075469 (-0.011962)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 163.304494 / 1.841788 (161.462706) 14.500697 / 8.074308 (6.426389) 41.443007 / 10.191392 (31.251614) 0.832287 / 0.680424 (0.151863) 0.650598 / 0.534201 (0.116397) 0.271475 / 0.579283 (-0.307808) 0.679078 / 0.434364 (0.244714) 0.446581 / 0.540337 (-0.093756) 1.316782 / 1.386936 (-0.070154)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.013660 / 0.011353 (0.002307) 0.004650 / 0.011008 (-0.006359) 0.041068 / 0.038508 (0.002560) 0.039261 / 0.023109 (0.016152) 0.421292 / 0.275898 (0.145394) 0.396053 / 0.323480 (0.072573) 0.008763 / 0.007986 (0.000777) 0.005943 / 0.004328 (0.001615) 0.011085 / 0.004250 (0.006835) 0.048923 / 0.037052 (0.011870) 0.387800 / 0.258489 (0.129311) 0.419130 / 0.293841 (0.125289) 0.034637 / 0.128546 (-0.093910) 0.012447 / 0.075646 (-0.063199) 0.308529 / 0.419271 (-0.110742) 0.056256 / 0.043533 (0.012724) 0.399335 / 0.255139 (0.144196) 0.413225 / 0.283200 (0.130026) 0.112832 / 0.141683 (-0.028851) 2.169108 / 1.452155 (0.716953) 2.127485 / 1.492716 (0.634769)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.452437 / 0.018006 (0.434430) 0.656038 / 0.000490 (0.655548) 0.068794 / 0.000200 (0.068594) 0.000365 / 0.000054 (0.000311)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.039865 / 0.037411 (0.002453) 0.028611 / 0.014526 (0.014085) 0.027908 / 0.176557 (-0.148649) 0.133649 / 0.737135 (-0.603487) 0.030044 / 0.296338 (-0.266295)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.524565 / 0.215209 (0.309355) 5.145593 / 2.077655 (3.067938) 2.383120 / 1.504120 (0.879000) 2.001321 / 1.541195 (0.460126) 1.989900 / 1.468490 (0.521410) 0.512648 / 4.584777 (-4.072129) 6.117416 / 3.745712 (2.371704) 7.609945 / 5.269862 (2.340083) 3.898983 / 4.565676 (-0.666693) 0.061152 / 0.424275 (-0.363123) 0.007940 / 0.007607 (0.000333) 0.691820 / 0.226044 (0.465776) 6.418109 / 2.268929 (4.149180) 3.001429 / 55.444624 (-52.443195) 2.314800 / 6.876477 (-4.561677) 2.381399 / 2.142072 (0.239326) 0.696329 / 4.805227 (-4.108898) 0.179426 / 6.500664 (-6.321238) 0.131597 / 0.075469 (0.056128)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 164.420213 / 1.841788 (162.578426) 15.062505 / 8.074308 (6.988197) 42.673592 / 10.191392 (32.482200) 0.939125 / 0.680424 (0.258701) 0.634342 / 0.534201 (0.100142) 0.278813 / 0.579283 (-0.300471) 0.693196 / 0.434364 (0.258833) 0.410685 / 0.540337 (-0.129653) 1.370317 / 1.386936 (-0.016619)

CML watermark

* make timit_asr streamable

* update docs about dirname

* fix test

* fix tests

* style

* fix windows test

* again
github-actions[bot]
Copy link

github-actions bot commented on 9a2dff6 Sep 7, 2021

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010680 / 0.011353 (-0.000673) 0.003895 / 0.011008 (-0.007113) 0.036106 / 0.038508 (-0.002402) 0.041874 / 0.023109 (0.018765) 0.375704 / 0.275898 (0.099806) 0.444731 / 0.323480 (0.121251) 0.009310 / 0.007986 (0.001324) 0.005135 / 0.004328 (0.000807) 0.010580 / 0.004250 (0.006329) 0.050894 / 0.037052 (0.013841) 0.365591 / 0.258489 (0.107102) 0.450586 / 0.293841 (0.156745) 0.026941 / 0.128546 (-0.101605) 0.008527 / 0.075646 (-0.067119) 0.295128 / 0.419271 (-0.124144) 0.053536 / 0.043533 (0.010003) 0.376496 / 0.255139 (0.121357) 0.428934 / 0.283200 (0.145735) 0.161137 / 0.141683 (0.019455) 1.970755 / 1.452155 (0.518601) 2.038220 / 1.492716 (0.545504)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.208152 / 0.018006 (0.190146) 0.474976 / 0.000490 (0.474486) 0.005298 / 0.000200 (0.005098) 0.000081 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.043047 / 0.037411 (0.005636) 0.027841 / 0.014526 (0.013315) 0.029407 / 0.176557 (-0.147149) 0.146158 / 0.737135 (-0.590977) 0.031797 / 0.296338 (-0.264542)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.407516 / 0.215209 (0.192307) 4.039315 / 2.077655 (1.961661) 2.061690 / 1.504120 (0.557570) 1.862885 / 1.541195 (0.321691) 1.932060 / 1.468490 (0.463569) 0.362821 / 4.584777 (-4.221956) 5.216290 / 3.745712 (1.470577) 5.090934 / 5.269862 (-0.178928) 2.468490 / 4.565676 (-2.097187) 0.043290 / 0.424275 (-0.380985) 0.006253 / 0.007607 (-0.001354) 0.523349 / 0.226044 (0.297305) 5.242612 / 2.268929 (2.973684) 2.576842 / 55.444624 (-52.867783) 2.176176 / 6.876477 (-4.700301) 2.236019 / 2.142072 (0.093946) 0.493755 / 4.805227 (-4.311472) 0.117713 / 6.500664 (-6.382951) 0.136697 / 0.075469 (0.061228)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 114.620037 / 1.841788 (112.778249) 14.820611 / 8.074308 (6.746303) 30.818470 / 10.191392 (20.627078) 0.870967 / 0.680424 (0.190544) 0.596853 / 0.534201 (0.062652) 0.264001 / 0.579283 (-0.315282) 0.578735 / 0.434364 (0.144371) 0.359360 / 0.540337 (-0.180977) 1.194462 / 1.386936 (-0.192474)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010930 / 0.011353 (-0.000423) 0.004151 / 0.011008 (-0.006857) 0.037303 / 0.038508 (-0.001205) 0.041966 / 0.023109 (0.018857) 0.341840 / 0.275898 (0.065942) 0.383249 / 0.323480 (0.059769) 0.009455 / 0.007986 (0.001469) 0.005255 / 0.004328 (0.000927) 0.010561 / 0.004250 (0.006311) 0.047801 / 0.037052 (0.010749) 0.328783 / 0.258489 (0.070294) 0.382246 / 0.293841 (0.088405) 0.027305 / 0.128546 (-0.101241) 0.008726 / 0.075646 (-0.066920) 0.300714 / 0.419271 (-0.118558) 0.053465 / 0.043533 (0.009932) 0.340268 / 0.255139 (0.085129) 0.364547 / 0.283200 (0.081348) 0.119441 / 0.141683 (-0.022242) 2.043396 / 1.452155 (0.591241) 2.089921 / 1.492716 (0.597205)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.347428 / 0.018006 (0.329421) 0.480203 / 0.000490 (0.479713) 0.046511 / 0.000200 (0.046312) 0.000306 / 0.000054 (0.000251)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.041756 / 0.037411 (0.004345) 0.025578 / 0.014526 (0.011052) 0.029624 / 0.176557 (-0.146933) 0.148213 / 0.737135 (-0.588923) 0.032081 / 0.296338 (-0.264257)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.412680 / 0.215209 (0.197471) 4.117487 / 2.077655 (2.039833) 2.123385 / 1.504120 (0.619265) 1.920258 / 1.541195 (0.379064) 2.006321 / 1.468490 (0.537831) 0.361922 / 4.584777 (-4.222855) 5.234374 / 3.745712 (1.488662) 5.666306 / 5.269862 (0.396444) 2.734976 / 4.565676 (-1.830700) 0.042486 / 0.424275 (-0.381789) 0.006339 / 0.007607 (-0.001268) 0.540043 / 0.226044 (0.313998) 5.355188 / 2.268929 (3.086259) 2.671628 / 55.444624 (-52.772997) 2.258153 / 6.876477 (-4.618324) 2.309141 / 2.142072 (0.167068) 0.510575 / 4.805227 (-4.294653) 0.117660 / 6.500664 (-6.383005) 0.136889 / 0.075469 (0.061420)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 125.026040 / 1.841788 (123.184252) 14.943394 / 8.074308 (6.869086) 30.563270 / 10.191392 (20.371878) 0.832791 / 0.680424 (0.152367) 0.588618 / 0.534201 (0.054417) 0.263443 / 0.579283 (-0.315840) 0.588013 / 0.434364 (0.153649) 0.427603 / 0.540337 (-0.112735) 1.216731 / 1.386936 (-0.170205)

CML watermark

Copy link
Member

@albertvillanova albertvillanova left a comment

The non passing test is about the dataset card. Some information is missing:

  • Data Fields section
  • pretty_name tag
        if error_messages:
>           raise ValueError("\n".join(error_messages))
E           ValueError: The following issues have been found in the dataset cards:
E           README Validation:
E           The following issues were found for the README at `/home/circleci/datasets/datasets/swedish_medical_ner/README.md`:
E           -	Expected some content in section `Data Fields` but it is empty.
E           The following issues have been found in the dataset cards:
E           YAML tags:
E           __init__() missing 1 required positional argument: 'pretty_name'

@lhoestq lhoestq mentioned this pull request Sep 10, 2021
@bwang482
Copy link
Contributor Author

@bwang482 bwang482 commented Sep 10, 2021

Hi, what's the current status of this request? It says Changes requested, but I can't see what changes?

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Sep 16, 2021

Hi, it looks like this PR includes changes to other files that swedish_medical_ner.

Feel free to remove these changes, or simply create a new PR that only contains the addition of the dataset

Copy link
Member

@lhoestq lhoestq left a comment

Thanks a lot for adding this dataset ! The dataset card and dataset script look all good to me, good job :)

I just added one comment.

Also feel free to ping me when your have removed the other changes or created a new PR


## Dataset Structure

### Data Instances
Copy link
Member

@lhoestq lhoestq Sep 16, 2021

In this section we expect to see an actual example from the dataset, as it is when people use the dataset

Feel free to get one example using load_dataset and dataset["train"][0] and put it here :)

Copy link
Contributor Author

@bwang482 bwang482 Sep 17, 2021

Thanks @lhoestq ! I have created a new PR #2940

@bwang482 bwang482 closed this Sep 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants