adding swedish_medical_ner #2873

bwang482 · 2021-09-07T04:44:53Z

Adding the Swedish Medical NER dataset, listed in "Biomedical Datasets - BigScience Workshop 2021"

Code refactored

fix the issue in #2846

* Update README.md Changed 'Tain' to 'Train'. * add pretty_name Co-authored-by: Quentin Lhoest <[email protected]>

* Update: Openwebtext - update data files checksums * update dataset card

github-actions · 2021-09-07T09:48:23Z

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010835 / 0.011353 (-0.000517)	0.004242 / 0.011008 (-0.006766)	0.036916 / 0.038508 (-0.001592)	0.040823 / 0.023109 (0.017714)	0.352466 / 0.275898 (0.076568)	0.394211 / 0.323480 (0.070731)	0.009327 / 0.007986 (0.001341)	0.005239 / 0.004328 (0.000911)	0.010630 / 0.004250 (0.006379)	0.050514 / 0.037052 (0.013461)	0.356330 / 0.258489 (0.097841)	0.405602 / 0.293841 (0.111761)	0.026724 / 0.128546 (-0.101823)	0.008958 / 0.075646 (-0.066689)	0.300794 / 0.419271 (-0.118478)	0.052408 / 0.043533 (0.008875)	0.358838 / 0.255139 (0.103699)	0.392754 / 0.283200 (0.109554)	0.162942 / 0.141683 (0.021259)	2.019208 / 1.452155 (0.567053)	2.123980 / 1.492716 (0.631264)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.218963 / 0.018006 (0.200956)	0.454086 / 0.000490 (0.453596)	0.014336 / 0.000200 (0.014136)	0.000102 / 0.000054 (0.000048)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.043113 / 0.037411 (0.005701)	0.027588 / 0.014526 (0.013062)	0.032057 / 0.176557 (-0.144499)	0.147344 / 0.737135 (-0.589791)	0.031864 / 0.296338 (-0.264475)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.430227 / 0.215209 (0.215018)	4.206032 / 2.077655 (2.128377)	2.212512 / 1.504120 (0.708393)	2.074008 / 1.541195 (0.532814)	2.113160 / 1.468490 (0.644670)	0.369798 / 4.584777 (-4.214979)	5.209165 / 3.745712 (1.463453)	6.082661 / 5.269862 (0.812800)	3.766156 / 4.565676 (-0.799520)	0.043588 / 0.424275 (-0.380687)	0.006442 / 0.007607 (-0.001166)	0.557979 / 0.226044 (0.331935)	5.616704 / 2.268929 (3.347776)	2.747667 / 55.444624 (-52.696957)	2.261781 / 6.876477 (-4.614696)	2.310395 / 2.142072 (0.168323)	0.503469 / 4.805227 (-4.301758)	0.117947 / 6.500664 (-6.382717)	0.061702 / 0.075469 (-0.013767)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	120.032847 / 1.841788 (118.191059)	14.919063 / 8.074308 (6.844755)	30.924377 / 10.191392 (20.732985)	0.862032 / 0.680424 (0.181608)	0.600701 / 0.534201 (0.066500)	0.260174 / 0.579283 (-0.319110)	0.583264 / 0.434364 (0.148900)	0.373701 / 0.540337 (-0.166637)	1.231935 / 1.386936 (-0.155001)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010727 / 0.011353 (-0.000626)	0.004111 / 0.011008 (-0.006897)	0.036678 / 0.038508 (-0.001830)	0.041482 / 0.023109 (0.018372)	0.344385 / 0.275898 (0.068487)	0.385757 / 0.323480 (0.062277)	0.009523 / 0.007986 (0.001538)	0.005459 / 0.004328 (0.001131)	0.010607 / 0.004250 (0.006356)	0.048749 / 0.037052 (0.011696)	0.339073 / 0.258489 (0.080584)	0.390318 / 0.293841 (0.096477)	0.027357 / 0.128546 (-0.101190)	0.008829 / 0.075646 (-0.066818)	0.303201 / 0.419271 (-0.116071)	0.054252 / 0.043533 (0.010719)	0.337872 / 0.255139 (0.082733)	0.394824 / 0.283200 (0.111625)	0.155547 / 0.141683 (0.013864)	2.010645 / 1.452155 (0.558491)	2.099573 / 1.492716 (0.606857)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.414019 / 0.018006 (0.396013)	0.478181 / 0.000490 (0.477692)	0.068522 / 0.000200 (0.068322)	0.000353 / 0.000054 (0.000299)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.042385 / 0.037411 (0.004973)	0.025997 / 0.014526 (0.011471)	0.029807 / 0.176557 (-0.146750)	0.148218 / 0.737135 (-0.588917)	0.032039 / 0.296338 (-0.264300)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.409857 / 0.215209 (0.194647)	4.013152 / 2.077655 (1.935498)	2.008397 / 1.504120 (0.504277)	1.826900 / 1.541195 (0.285705)	1.911383 / 1.468490 (0.442893)	0.357248 / 4.584777 (-4.227529)	5.193846 / 3.745712 (1.448134)	5.003012 / 5.269862 (-0.266850)	2.982003 / 4.565676 (-1.583673)	0.042828 / 0.424275 (-0.381447)	0.007060 / 0.007607 (-0.000547)	0.541243 / 0.226044 (0.315199)	5.321754 / 2.268929 (3.052826)	2.618005 / 55.444624 (-52.826620)	2.220121 / 6.876477 (-4.656356)	2.249974 / 2.142072 (0.107902)	0.487443 / 4.805227 (-4.317784)	0.113382 / 6.500664 (-6.387282)	0.136074 / 0.075469 (0.060605)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	126.009356 / 1.841788 (124.167568)	15.135564 / 8.074308 (7.061256)	30.782740 / 10.191392 (20.591348)	0.848738 / 0.680424 (0.168314)	0.592777 / 0.534201 (0.058576)	0.273661 / 0.579283 (-0.305622)	0.594040 / 0.434364 (0.159676)	0.441454 / 0.540337 (-0.098883)	1.224994 / 1.386936 (-0.161942)

* Test xpathjoin * Implement xpathjoin * Use xpathjoin to patch Path joinpath/__truediv__ * Update docstring of extend_module_for_streaming * Test xpathopen * Implement xpathopen * Use xpathopen to patch Path.open * Clean tests for streaming * Test _as_posix * Fix _as_posix for hops starting with slash * Add docstrings * Test xjoin with local paths

github-actions · 2021-09-07T11:57:48Z

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.011487 / 0.011353 (0.000134)	0.004450 / 0.011008 (-0.006558)	0.040659 / 0.038508 (0.002151)	0.037813 / 0.023109 (0.014704)	0.348414 / 0.275898 (0.072516)	0.391464 / 0.323480 (0.067984)	0.010056 / 0.007986 (0.002071)	0.005401 / 0.004328 (0.001072)	0.010109 / 0.004250 (0.005858)	0.047616 / 0.037052 (0.010563)	0.360686 / 0.258489 (0.102197)	0.391118 / 0.293841 (0.097277)	0.032594 / 0.128546 (-0.095952)	0.011619 / 0.075646 (-0.064027)	0.312985 / 0.419271 (-0.106287)	0.052712 / 0.043533 (0.009179)	0.395351 / 0.255139 (0.140212)	0.390973 / 0.283200 (0.107774)	0.152909 / 0.141683 (0.011227)	2.118943 / 1.452155 (0.666789)	2.087098 / 1.492716 (0.594382)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.230118 / 0.018006 (0.212112)	0.556266 / 0.000490 (0.555777)	0.005600 / 0.000200 (0.005400)	0.000120 / 0.000054 (0.000066)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.042019 / 0.037411 (0.004608)	0.029746 / 0.014526 (0.015220)	0.029159 / 0.176557 (-0.147397)	0.139175 / 0.737135 (-0.597960)	0.031614 / 0.296338 (-0.264725)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.543627 / 0.215209 (0.328418)	4.787100 / 2.077655 (2.709446)	2.262434 / 1.504120 (0.758314)	1.950349 / 1.541195 (0.409154)	2.020002 / 1.468490 (0.551512)	0.505519 / 4.584777 (-4.079258)	6.367098 / 3.745712 (2.621385)	6.007513 / 5.269862 (0.737652)	3.087287 / 4.565676 (-1.478389)	0.057883 / 0.424275 (-0.366393)	0.006543 / 0.007607 (-0.001065)	0.627519 / 0.226044 (0.401474)	6.350018 / 2.268929 (4.081090)	2.947091 / 55.444624 (-52.497533)	2.308398 / 6.876477 (-4.568078)	2.347615 / 2.142072 (0.205542)	0.677322 / 4.805227 (-4.127905)	0.148305 / 6.500664 (-6.352359)	0.063507 / 0.075469 (-0.011962)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	163.304494 / 1.841788 (161.462706)	14.500697 / 8.074308 (6.426389)	41.443007 / 10.191392 (31.251614)	0.832287 / 0.680424 (0.151863)	0.650598 / 0.534201 (0.116397)	0.271475 / 0.579283 (-0.307808)	0.679078 / 0.434364 (0.244714)	0.446581 / 0.540337 (-0.093756)	1.316782 / 1.386936 (-0.070154)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.013660 / 0.011353 (0.002307)	0.004650 / 0.011008 (-0.006359)	0.041068 / 0.038508 (0.002560)	0.039261 / 0.023109 (0.016152)	0.421292 / 0.275898 (0.145394)	0.396053 / 0.323480 (0.072573)	0.008763 / 0.007986 (0.000777)	0.005943 / 0.004328 (0.001615)	0.011085 / 0.004250 (0.006835)	0.048923 / 0.037052 (0.011870)	0.387800 / 0.258489 (0.129311)	0.419130 / 0.293841 (0.125289)	0.034637 / 0.128546 (-0.093910)	0.012447 / 0.075646 (-0.063199)	0.308529 / 0.419271 (-0.110742)	0.056256 / 0.043533 (0.012724)	0.399335 / 0.255139 (0.144196)	0.413225 / 0.283200 (0.130026)	0.112832 / 0.141683 (-0.028851)	2.169108 / 1.452155 (0.716953)	2.127485 / 1.492716 (0.634769)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.452437 / 0.018006 (0.434430)	0.656038 / 0.000490 (0.655548)	0.068794 / 0.000200 (0.068594)	0.000365 / 0.000054 (0.000311)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.039865 / 0.037411 (0.002453)	0.028611 / 0.014526 (0.014085)	0.027908 / 0.176557 (-0.148649)	0.133649 / 0.737135 (-0.603487)	0.030044 / 0.296338 (-0.266295)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.524565 / 0.215209 (0.309355)	5.145593 / 2.077655 (3.067938)	2.383120 / 1.504120 (0.879000)	2.001321 / 1.541195 (0.460126)	1.989900 / 1.468490 (0.521410)	0.512648 / 4.584777 (-4.072129)	6.117416 / 3.745712 (2.371704)	7.609945 / 5.269862 (2.340083)	3.898983 / 4.565676 (-0.666693)	0.061152 / 0.424275 (-0.363123)	0.007940 / 0.007607 (0.000333)	0.691820 / 0.226044 (0.465776)	6.418109 / 2.268929 (4.149180)	3.001429 / 55.444624 (-52.443195)	2.314800 / 6.876477 (-4.561677)	2.381399 / 2.142072 (0.239326)	0.696329 / 4.805227 (-4.108898)	0.179426 / 6.500664 (-6.321238)	0.131597 / 0.075469 (0.056128)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	164.420213 / 1.841788 (162.578426)	15.062505 / 8.074308 (6.988197)	42.673592 / 10.191392 (32.482200)	0.939125 / 0.680424 (0.258701)	0.634342 / 0.534201 (0.100142)	0.278813 / 0.579283 (-0.300471)	0.693196 / 0.434364 (0.258833)	0.410685 / 0.540337 (-0.129653)	1.370317 / 1.386936 (-0.016619)

* make timit_asr streamable * update docs about dirname * fix test * fix tests * style * fix windows test * again

github-actions · 2021-09-07T13:29:27Z

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010680 / 0.011353 (-0.000673)	0.003895 / 0.011008 (-0.007113)	0.036106 / 0.038508 (-0.002402)	0.041874 / 0.023109 (0.018765)	0.375704 / 0.275898 (0.099806)	0.444731 / 0.323480 (0.121251)	0.009310 / 0.007986 (0.001324)	0.005135 / 0.004328 (0.000807)	0.010580 / 0.004250 (0.006329)	0.050894 / 0.037052 (0.013841)	0.365591 / 0.258489 (0.107102)	0.450586 / 0.293841 (0.156745)	0.026941 / 0.128546 (-0.101605)	0.008527 / 0.075646 (-0.067119)	0.295128 / 0.419271 (-0.124144)	0.053536 / 0.043533 (0.010003)	0.376496 / 0.255139 (0.121357)	0.428934 / 0.283200 (0.145735)	0.161137 / 0.141683 (0.019455)	1.970755 / 1.452155 (0.518601)	2.038220 / 1.492716 (0.545504)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.208152 / 0.018006 (0.190146)	0.474976 / 0.000490 (0.474486)	0.005298 / 0.000200 (0.005098)	0.000081 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.043047 / 0.037411 (0.005636)	0.027841 / 0.014526 (0.013315)	0.029407 / 0.176557 (-0.147149)	0.146158 / 0.737135 (-0.590977)	0.031797 / 0.296338 (-0.264542)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.407516 / 0.215209 (0.192307)	4.039315 / 2.077655 (1.961661)	2.061690 / 1.504120 (0.557570)	1.862885 / 1.541195 (0.321691)	1.932060 / 1.468490 (0.463569)	0.362821 / 4.584777 (-4.221956)	5.216290 / 3.745712 (1.470577)	5.090934 / 5.269862 (-0.178928)	2.468490 / 4.565676 (-2.097187)	0.043290 / 0.424275 (-0.380985)	0.006253 / 0.007607 (-0.001354)	0.523349 / 0.226044 (0.297305)	5.242612 / 2.268929 (2.973684)	2.576842 / 55.444624 (-52.867783)	2.176176 / 6.876477 (-4.700301)	2.236019 / 2.142072 (0.093946)	0.493755 / 4.805227 (-4.311472)	0.117713 / 6.500664 (-6.382951)	0.136697 / 0.075469 (0.061228)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	114.620037 / 1.841788 (112.778249)	14.820611 / 8.074308 (6.746303)	30.818470 / 10.191392 (20.627078)	0.870967 / 0.680424 (0.190544)	0.596853 / 0.534201 (0.062652)	0.264001 / 0.579283 (-0.315282)	0.578735 / 0.434364 (0.144371)	0.359360 / 0.540337 (-0.180977)	1.194462 / 1.386936 (-0.192474)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010930 / 0.011353 (-0.000423)	0.004151 / 0.011008 (-0.006857)	0.037303 / 0.038508 (-0.001205)	0.041966 / 0.023109 (0.018857)	0.341840 / 0.275898 (0.065942)	0.383249 / 0.323480 (0.059769)	0.009455 / 0.007986 (0.001469)	0.005255 / 0.004328 (0.000927)	0.010561 / 0.004250 (0.006311)	0.047801 / 0.037052 (0.010749)	0.328783 / 0.258489 (0.070294)	0.382246 / 0.293841 (0.088405)	0.027305 / 0.128546 (-0.101241)	0.008726 / 0.075646 (-0.066920)	0.300714 / 0.419271 (-0.118558)	0.053465 / 0.043533 (0.009932)	0.340268 / 0.255139 (0.085129)	0.364547 / 0.283200 (0.081348)	0.119441 / 0.141683 (-0.022242)	2.043396 / 1.452155 (0.591241)	2.089921 / 1.492716 (0.597205)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.347428 / 0.018006 (0.329421)	0.480203 / 0.000490 (0.479713)	0.046511 / 0.000200 (0.046312)	0.000306 / 0.000054 (0.000251)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.041756 / 0.037411 (0.004345)	0.025578 / 0.014526 (0.011052)	0.029624 / 0.176557 (-0.146933)	0.148213 / 0.737135 (-0.588923)	0.032081 / 0.296338 (-0.264257)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.412680 / 0.215209 (0.197471)	4.117487 / 2.077655 (2.039833)	2.123385 / 1.504120 (0.619265)	1.920258 / 1.541195 (0.379064)	2.006321 / 1.468490 (0.537831)	0.361922 / 4.584777 (-4.222855)	5.234374 / 3.745712 (1.488662)	5.666306 / 5.269862 (0.396444)	2.734976 / 4.565676 (-1.830700)	0.042486 / 0.424275 (-0.381789)	0.006339 / 0.007607 (-0.001268)	0.540043 / 0.226044 (0.313998)	5.355188 / 2.268929 (3.086259)	2.671628 / 55.444624 (-52.772997)	2.258153 / 6.876477 (-4.618324)	2.309141 / 2.142072 (0.167068)	0.510575 / 4.805227 (-4.294653)	0.117660 / 6.500664 (-6.383005)	0.136889 / 0.075469 (0.061420)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	125.026040 / 1.841788 (123.184252)	14.943394 / 8.074308 (6.869086)	30.563270 / 10.191392 (20.371878)	0.832791 / 0.680424 (0.152367)	0.588618 / 0.534201 (0.054417)	0.263443 / 0.579283 (-0.315840)	0.588013 / 0.434364 (0.153649)	0.427603 / 0.540337 (-0.112735)	1.216731 / 1.386936 (-0.170205)

albertvillanova

The non passing test is about the dataset card. Some information is missing:

Data Fields section
pretty_name tag

        if error_messages:
>           raise ValueError("\n".join(error_messages))
E           ValueError: The following issues have been found in the dataset cards:
E           README Validation:
E           The following issues were found for the README at `/home/circleci/datasets/datasets/swedish_medical_ner/README.md`:
E           -	Expected some content in section `Data Fields` but it is empty.
E           The following issues have been found in the dataset cards:
E           YAML tags:
E           __init__() missing 1 required positional argument: 'pretty_name'

bwang482 · 2021-09-10T15:41:19Z

Hi, what's the current status of this request? It says Changes requested, but I can't see what changes?

lhoestq · 2021-09-16T09:26:04Z

Hi, it looks like this PR includes changes to other files that swedish_medical_ner.

Feel free to remove these changes, or simply create a new PR that only contains the addition of the dataset

lhoestq

Thanks a lot for adding this dataset ! The dataset card and dataset script look all good to me, good job :)

I just added one comment.

Also feel free to ping me when your have removed the other changes or created a new PR

lhoestq · 2021-09-16T09:29:11Z

datasets/swedish_medical_ner/README.md

+
+## Dataset Structure
+
+### Data Instances


In this section we expect to see an actual example from the dataset, as it is when people use the dataset

Feel free to get one example using load_dataset and dataset["train"][0] and put it here :)

Thanks @lhoestq ! I have created a new PR #2940

bwang482 and others added 5 commits Sep 6, 2021

adding swedish_medical_ner

de8893e

adding swedish_medical_ner

0638090

fix regex to accept negative timezone (#2847)

9ca2425

fix the issue in #2846

Update README.md (#2848)

690e9a9

* Update README.md Changed 'Tain' to 'Train'. * add pretty_name Co-authored-by: Quentin Lhoest <[email protected]>

Update: Openwebtext - update size (#2857)

67ae139

* Update: Openwebtext - update data files checksums * update dataset card

Update: timit_asr - make the dataset streamable (#2835)

9a2dff6

* make timit_asr streamable * update docs about dirname * fix test * fix tests * style * fix windows test * again

bwang482 added 5 commits Sep 7, 2021

adding swedish_medical_ner

eade6e9

adding swedish_medical_ner

c1b6cf8

add swedish_medical_ner dataset

a1eded0

add swedish_medical_ner dataset

4579ec1

add swedish_medical_ner dataset

b3b8153

bwang482 mentioned this pull request Sep 7, 2021

datasets.config.PYARROW_VERSION has no attribute 'major' #2871

Closed

albertvillanova requested changes Sep 8, 2021

View changes

bwang482 added 3 commits Sep 8, 2021

add swedish_medical_ner dataset

8011341

add swedish_medical_ner dataset

15b7cef

add BIOSSES dataset

0f351ec

lhoestq mentioned this pull request Sep 10, 2021

Add BIOSSES dataset #2881

Merged

lhoestq reviewed Sep 16, 2021

View changes

bwang482 closed this Sep 17, 2021

Sep	OCT	Nov
	14
2020	2021	2022

huggingface / datasets Public

adding swedish_medical_ner #2873

adding swedish_medical_ner #2873

bwang482 commented Sep 7, 2021

github-actions bot commented on `9ca2425` Sep 7, 2021

github-actions bot commented on `486e7ba` Sep 7, 2021

github-actions bot commented on `9a2dff6` Sep 7, 2021

albertvillanova left a comment •

edited

bwang482 commented Sep 10, 2021

lhoestq commented Sep 16, 2021

lhoestq left a comment

lhoestq Sep 16, 2021

bwang482 Sep 17, 2021

huggingface / datasets Public

adding swedish_medical_ner #2873

adding swedish_medical_ner #2873

Conversation

bwang482 commented Sep 7, 2021

github-actions bot commented on 9ca2425 Sep 7, 2021

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented on 486e7ba Sep 7, 2021

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented on 9a2dff6 Sep 7, 2021

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova left a comment • edited

bwang482 commented Sep 10, 2021

lhoestq commented Sep 16, 2021

lhoestq left a comment

lhoestq Sep 16, 2021

bwang482 Sep 17, 2021

github-actions bot commented on `9ca2425` Sep 7, 2021

github-actions bot commented on `486e7ba` Sep 7, 2021

github-actions bot commented on `9a2dff6` Sep 7, 2021

albertvillanova left a comment •

edited