The Wayback Machine - https://web.archive.org/web/20211003225213/https://github.com/SheetJS/sheetjs/issues/2239
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Csv with semicolon-comma wrong parsing #2239

Open
defusioner opened this issue Feb 25, 2021 · 5 comments
Open

Csv with semicolon-comma wrong parsing #2239

defusioner opened this issue Feb 25, 2021 · 5 comments

Comments

@defusioner
Copy link

@defusioner defusioner commented Feb 25, 2021

Hi, I'm trying to read a CSV file like this (;, delimiters):

A;B;C;D
1,1;2,2;3,3;4,1
1,1;2,2;3,3;4,1
1,1;2,2;3,3;4,1

What happens: semicolon from the first line does not become the main separator, the comma from the second line does and it breaks the file's logic.

Code that reads:

X.read(e.target.result, { type: 'string' })

Result

{
      A1: { t: 's', v: 'A;B;C;D' },
      A2: { t: 'n', w: '1', v: 1 },
      B2: { t: 's', v: '1;2' },
      C2: { t: 's', v: '2;3' },
      D2: { t: 's', v: '3;4' },
      E2: { t: 'n', w: '1', v: 1 },
      A3: { t: 'n', w: '1', v: 1 },
      B3: { t: 's', v: '1;2' },
      C3: { t: 's', v: '2;3' },
      D3: { t: 's', v: '3;4' },
      E3: { t: 'n', w: '1', v: 1 },
      A4: { t: 'n', w: '1', v: 1 },
      B4: { t: 's', v: '1;2' },
      C4: { t: 's', v: '2;3' },
      D4: { t: 's', v: '3;4' },
      E4: { t: 'n', w: '1', v: 1 },
      '!ref': 'A1:E4'
    }

I have seen your comment about standard delimiters, but normally semicolon is a basic one (#1087 (comment))

Probably I am doing something wrong? Thanks

@SheetJSDev
Copy link
Contributor

@SheetJSDev SheetJSDev commented Sep 10, 2021

The overarching issue is locales (in some locales, commas are treated as decimal points and semicolons are the field separators). Excel's general backdoor for the delimiter is the sep= pragma.

We'd accept a PR that adds an override. The change would be in bits/40_harb.js:

		if(str.slice(0,4) == "sep=") {
			// If the line ends in \r\n
			if(str.charCodeAt(5) == 13 && str.charCodeAt(6) == 10 ) {
				sep = str.charAt(4); str = str.slice(7);
			}
			// If line ends in \r OR \n
			else if(str.charCodeAt(5) == 13 || str.charCodeAt(5) == 10 ) {
				sep = str.charAt(4); str = str.slice(6);
			}
			else sep = guess_sep(str.slice(0,1024));
		}
+		else if(opts && opts.FS) sep = opts.FS;
		else sep = guess_sep(str.slice(0,1024));

and the end code would be

X.read(e.target.result, { type: 'string', FS: ";" })

@Swastik2000
Copy link

@Swastik2000 Swastik2000 commented Sep 12, 2021

@SheetJSDev Can I take this issue?

@defusioner
Copy link
Author

@defusioner defusioner commented Sep 12, 2021

The overarching issue is locales (in some locales, commas are treated as decimal points and semicolons are the field separators). Excel's general backdoor for the delimiter is the sep= pragma.

We'd accept a PR that adds an override. The change would be in bits/40_harb.js:

		if(str.slice(0,4) == "sep=") {
			// If the line ends in \r\n
			if(str.charCodeAt(5) == 13 && str.charCodeAt(6) == 10 ) {
				sep = str.charAt(4); str = str.slice(7);
			}
			// If line ends in \r OR \n
			else if(str.charCodeAt(5) == 13 || str.charCodeAt(5) == 10 ) {
				sep = str.charAt(4); str = str.slice(6);
			}
			else sep = guess_sep(str.slice(0,1024));
		}
+		else if(opts && opts.FS) sep = opts.FS;
		else sep = guess_sep(str.slice(0,1024));

and the end code would be

X.read(e.target.result, { type: 'string', FS: ";" })

We have done exactly this workaround on the code side (reading the first line and finding out the separator). I thought it could be done automatically.

@SheetJSDev
Copy link
Contributor

@SheetJSDev SheetJSDev commented Sep 12, 2021

The current behavior is consistent with Excel.

Consider the original sample file:

A;B;C;D
1,1;2,2;3,3;4,1
1,1;2,2;3,3;4,1
1,1;2,2;3,3;4,1

In the English (United States) regional setting, Excel will parse as if the field separator (Windows calls this "List separator") is the comma character:

If you are using a regional setting with ";" as the list separator and "," as the decimal symbol, then you will get the parse you expect.

In windows 7 these settings can be found in "Control Panel" > "Clock, Language, and Region" > "Region and Language" > "Additional Settings". en_US has the following settings:

Note: Excel's backdoor for controlling the list separator is the aforementioned sep= pragma. If you do specify it, Excel will use the separator but it won't change the decimal separator:

@malek227
Copy link

@malek227 malek227 commented Sep 30, 2021

#malek``

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants