Melodeon.net Forums

Please login or register.

Login with username, password and session length
Advanced search  

News:

Welcome to the new melodeon.net forum

Pages: [1]   Go Down

Author Topic: Rogue Decimals in ABC files  (Read 1039 times)

0 Members and 1 Guest are viewing this topic.

Gena Crisman

  • Hero Member
  • *****
  • Offline Offline
  • Posts: 1041
  • 🇬🇧
Rogue Decimals in ABC files
« on: December 22, 2019, 11:06:13 PM »

The 'too long, didn't read' version of this post is: An old piece of software, called swtoabc, may be responsible for adding long, non-working decimal numbers such as 3.99999962500005 into some ABC files when it attempted to save triplet style notes. If you see that kind of thing, and you balance out the ratio, and the bottom of the fraction is a factor of 3, most likely what you're looking at should be a triplet.

Anyway: In a recent thread, Roger Hare stated he had a few files which were broken due to long floating point numbers. When learning of this, I said:
...For some reason the numbers have been converted to floating point style numbers by a computer...
If you ever come across a tune or ideally a large file with a lot of this kind of damage, shoot me a
message, as it would tickle me pink to fix it.

And just this morning I was contacted by Roger with some files he'd found!

I've also identified the source of this file - it's the ABC file download at: http://sniff.numachi.com/.
This is a tar.gz file which unpacks into a directory with ~4000 individual tune files. Many of them are 'OK',
but there are ~90 with this sort of artefact included.

They seem to have been converted to ABC using a swtoabc program (sw=SongWriter?). I wonder if this
introduced the problem, or if it is an unwanted side-effect of tar-ing and gz-ing the files and then reversing
the process?

It would tickle me pink too, to see how you go about fixing it - I haven't the nerve, or the musical savvy
to be able to do it with any degree of confidence.

So, thank you again Roger for this early christmas present! Having acquired the archive, there are apparently a couple of thousand instances of these type numbers, which I will usually be calling 'floats', meaning, floating point numbers. It turns out that they are all divisions of one number over another. Using a regex (a kind of fancy search that can find results that fit a pattern, instead of a specific word or phrase) to find all of them and then permute them (which it say, keep only the unique results), those thousands of results boil down to there being 6 different ratios across the spewed over 247 suspect tune files. It is important to note that, as Roger said, every file in this archive was created by a piece of software called swtoabc, a program that can apparently convert SongWright .tun files that the scores were originally written in, into ABC files.

Using this regex: [0-9]\.[0-9] to find all floating points in files originally, I then scanned the results with this regex: [0-9]+\.[0-9]+/[0-9\.]+ to pull out all the examples of division, which were:
3.99999962500005/11.9999985000002
3.99999962500005/5.99999925000009
5.3333335/4
15.9999925000037/23.999988000006
3.99999962500005/23.9999970000004
21.333334/16


No further floating point numbers of any kind were found besides those listed above. Now, some of these look ridiculous; 5.3333/4?? Who writes a number like that? That's just... 4/3! Then a realisation hit me: These are swtoabc's attempts at writing triplets! So, while I could fix these with a simple find/replace for the above with a ratio that makes sense, such as:
3.99999962500005/11.9999985000002 = 4/12 = 1/3
3.99999962500005/5.99999925000009 = 4/6 = 2/3
15.9999925000037/23.999988000006 = 16/24 = 2/3
3.99999962500005/23.9999970000004 = 4/24 = 1/6
5.3333335/4 = 4/3
21.333334/16 = 4/3


While practicably these are probably 'correct', there is most likely a better approach in terms of readability of the score. On top of this, I did notice that unfortunately some of these ratio sets were showing up like this (spaces & bold added for clarity):
(3 e5.3333335/4 e5.3333335/4 f5.3333335/4 e5.3333335/4 f5.3333335/4 e5.3333335/4
(2 C3.99999962500005/5.99999925000009 C3.99999962500005/5.99999925000009 C3.99999962500005/5.99999925000009 D3.99999962500005/5.99999925000009|


These examples include a triplet & duplet ABC instruction prior to the notes which, given it is already trying to write triplets as ratio note lengths, may actually be completely erroneous? For example, in the former case, there are clearly 6 notes that have adjusted lengths, and in the latter case, there are 4. If we look at the whole bar, something is clearly amiss:
ERINSLEE.abc
T:Down Erin's Lovely Lee
M:12/8
L:1/16
% (so we're expecting it to add up to 24)
|(3e5.3333335/4e5.3333335/4f5.3333335/4e5.3333335/4f5.3333335/4e5.3333335/4 d2 B2-c-B-A B c6- c4 C2|


If we add the latter notes, we see 2 + 2 + 1 + 1 + 1 + 1 + 6 + 4 + 2, which = 20, and we want to put 6 notes in that 4/16 space? With 3 of them as weird double-triplets?? That doesn't add up, the latter 3 4/3s notes alone would add up to our remaining 4, filling the bar.

And for the 2nd example:
DYINGNUN.abc
T:The Dying Nun
M:3/4
L:1/8
(2C3.99999962500005/5.99999925000009C3.99999962500005/5.99999925000009C3.99999962500005/5.99999925000009 D3.99999962500005/5.99999925000009|\
 E3 GA|


We know this is 4 * 2/3 note lengths which, if we ignore the (2, adds up to 2 & 2/3, and, if we include the 2, well, genuinely I'd have no idea what that adds up to, the notation reference suggests maybe 3 & 1/3 total? Neither of those is good! Plus... the bar after only adds up to 5??!

Fortunately, we have a secret weapon in this mystery: these tunes are made available in other formats from that original website. Here's a URL to these two tunes:
http://sniff.numachi.com/pages/tiERINSLEE;ttERINSLEE.html
http://sniff.numachi.com/pages/tiDYINGNUN;ttDYINGNUN.html

Gazing upon the scores+files here, we learn a few things - mainly we learn that this swtoabc program has some serious flaws when it comes to triplets. In the first case, Down Erin's Lovely Lee, we're looking at bar 4, which I would probably write the ABC as: (3e2f2e2 d2 (B2cBA)B c6-c4 C2. So we can see there's some clear issue in how the software interpreted the tun file and as a result there was some corruption of data here - the note pattern above is some kind of hugely mistimed e e f e f e monster? In Dying Nun, our first 2 bars should be (3C2C2E2 | E3 G AB |. Why was it written as (2, so duplets, and with 4 notes? And, where did that B go???

So, clearly, I do suspect there's going to be a good number of problems with these abc files, some unrelated to these note lengths - The ABC for Dying Nun does not reflect the 5/8 change, for example.

Sticking to my initial goal, though, I assumed that perhaps there were subtle differences in the scores that caused swtoabc to make these varied mistakes. Investigating further, I discovered this convenient tune: http://sniff.numachi.com/pages/ttB10_80.html - one of 4 versions of 'The Twa Sisters'. It would seem that swtoabc converted the triplets in this instance of the file in 3 different ways.

The first two sets of triplets come in like this:
E3.99999962500005/5.99999925000009 D3.99999962500005/5.99999925000009 C3.99999962500005/5.99999925000009
E3.99999962500005/5.99999925000009 D3.99999962500005/5.99999925000009 C3.99999962500005/5.99999925000009

So, it hoped to give us 3 '4/6' notes. This actually makes sense, as 3 2/3 notes = 2, which is the point of a triplet. But, if I actually find/replaced these with 2/3, then on my score you'd be looking at some quavers with 3 dots after them... which I'm not sure is even remotely correct. They should really be marked as triplets.

The next set though, which are at the start of line 2, are a set that are supposed to be slurred together. Instead, these come in like this:
(3C3.99999962500005/5.99999925000009C3.99999962500005/5.99999925000009D3.99999962500005/5.99999925000009C3.99999962500005/5.99999925000009D3.99999962500005/5.99999925000009E3.99999962500005/5.99999925000009|
Here we now have a (3, where as we didn't above, and we also have a note pattern of C C D C D E, all '4/6' again. So now we have twice as many notes as we're supposed to, and the duration is some hideous combination of the triplet instruction and the ratios. Earlier we had e e f e f e != e f e, now we have C C D C D E != C D E, so, I'm not sure if it's the 4th 5th and 6th notes that are correct, or 1 3 and 6, or 2 3 and 6, or what. More samples would be required.

The 4th triplet set in this tune are untied, like the 1st and 2nd set, and come in with floating poing lengths but otherwise unmolested.

The last line, though, has triplets with a single slur. These come in like this:

(2E3.99999962500005/5.99999925000009E3.99999962500005/5.99999925000009D3.99999962500005/5.99999925000009 C3.99999962500005/5.99999925000009 (2C3.99999962500005/5.99999925000009C3.99999962500005/5.99999925000009D3.99999962500005/5.99999925000009 E3.99999962500005/5.99999925000009

Respectively these should be something like (3(ED)C and (3(CD)E, and are basically written more as (2EED C and (2CCD E in the ABC file. So, the first note gets doubled, put in a duplet because ???, the note lengths are turned into almost-garbage, and a then a space is added before the final note. Cool. It's kind of fascinating as you can see that having any ties or extra stuff going on is most likely the reason the converter was getting confused - potentially the program has is trying to resolve two things at the same time same time, and ends up writing duplicate notes and loose ends it hasn't worked out.

So, what have we learnt so far?
Of the 4000 ABC files that were generated from tun files by swtoabc for this archive, we have 247 files with floating points erroneously written in. They are all related to instances of triplets in the tunes. They all match this regex: [0-9]+\.[0-9]+/[0-9\.]+. Of these, by using the following regex: \([0-9][\^_=]*[a-bA-Z][,']*[0-9]+\.[0-9]+ we are able to identify all triplets, duplets etc that are then followed by erroneous floating point numbers after them, at least in the test batch. This does however include the ABC note itself and any modifiers, so if we considered other tune databases, any modifiers I've not included in my regex ( [\^_=] and [,'] are the currently incldued ones) would cause an instance to not be detected, so, it is likely worthwhile to simply assume that there will be no white space between the creation of the triplet with (3, and the floating point number, as we know these were generated by a computer and should all follow the same template. Doing a search of the file base for triplets etc, and then testing for those that don't include a floating point yielded 0 results, so, they do all seem to follow our expectations.

Any instance that includes a (3 or a (2 most likely has note corruption in the form of several duplicate notes. Working out how these notes are duplicated may be impossible without manual verification vs the source material, but, it may also be possible to repair it if the placement of the extra mark and notes depends on the location of the ties that caused the issue in the first place.

We know we can repair Case (2:
It would appear that (2 appears when either the 1st and 2nd, or 2nd and 3rd note are tied or slurred together. The first note to be tie/slurred is seemingly duplicated, and preceded by (2. So, where (2 is found, it should be removed and the note following it should be removed, too. (An example of a 2+3 corruption can be found in their tune Six Dukes.). After this duplet corruption is fixed, we can fix the ratios by replacing them with the appropriate triplet code.

I'm less sure about case (3:
This seems to occur in different circumstances. Shorty George is a divine example of insanity http://sniff.numachi.com/pages/tiSHORTGEO;ttSHORTGEO.html :
z1/3 z1/3(3G,1/3G,1/3A,1/3G,1/3A,1/3A,1/3 B,1/3
This pickup bar should be, all triplet length, (3z z G (3A A B, but what we have here is nonsense lengths and z z (3G G A G A A B. The (3 appears as soon as the first tie/slur appears, in this case between the 3rd triplet of the first 3 (the G,) slurred to the 2nd triplet of the 2nd 3 (the 2nd A,). So, in this case, the symbol is not at the start of a 'set' of triplets, but instead woven inside a set. Consistently, I am observing that after the (3, 6 notes appear, and the last 3 are always the correct 3 notes. However, what I am seeing here is that there is no clarity on where any slur or tie is supposed to end - that information can't be recovered from the ABC as is.
Additionally for Shorty George, the last line sports a bar with a regular length note, a d, tied into a set of triplets. This demonstrates that only the tie/slur of a triplet note to another causes a (2 to occur, rather than any regular length note.

Also all ties and slurs are messed up throughout all these ABC documents, - is used only as a tie now, not for slurs, and many of these are intended to be slurs. Perhaps the ABC spec changed at some point so, idk if these were 'wrong at the time'. However, I'm not as worried about that kind of issue, I'm just saying, I am aware of it. I guess my next step is to automate fixing of the (2 and 3( cases, by eradicating them, and then replace unreadable ratios with readable triplets...

If anyone else finds some examples of wacky floating points in their ABC file and they think it's not linked to swtoabc, feel free also to post about them I guess here and maybe I can help triage & repair the files. (I should also say, I'm mostly doing this on a lark)
Logged

Roger Hare

  • Hero Member
  • *****
  • Offline Offline
  • Posts: 828
  • Urmston, Lancashire, U.K.
Re: Rogue Decimals in ABC files
« Reply #1 on: December 23, 2019, 05:53:15 AM »

Thank you Gena! Amazing piece of work - so amazing that the yolk from my fried egg roll is dribbling
all over my keyboard as I sit here open-mouthed... 🎅

See, I was right! I do not have the musical savvy to be able to fix this with any degree of confidence... 🎅

A few brief points:

I've been told that the slur/tie thing was because they look the same on a printed score, so it 'didn't
matter' (but it does when you play back the MIDI output).

I encountered this problem while looking for large 'legacy' files to test a program I'm developing to
'add value' to existing ABC files. I didn't pursue the matter because my main goal was to test my
program, not to fix broken ABC files, so thank you.

Unfortunately, the directory I brought to your attention was not the place where I first encountered
this problem. This was a single file containing many tunes with the same problem. If I track it down,
I will bring it to your attention.

I've encountered a different sort of problem on one other occasion, where <something2abc> had been
used to convert files written using some long-forgotten software to ABC. In that case, it was dead easy
because the problem was that the <something2abc> simply left redundant instructions from <something>
in the generated ABC files. Simple fix - just do a global delete to remove those redundant instructions,
and Bob's yer uncle...

Roger

🦌🦌🦌🦌🦌🦌🛷 🎅
« Last Edit: December 23, 2019, 06:20:09 AM by Roger Hare »
Logged
For more about Manchester Morris, The Beech Band Folk Club or anything else,  please use the private messaging facility.
My (large) ABC Tune Book is here.

Gena Crisman

  • Hero Member
  • *****
  • Offline Offline
  • Posts: 1041
  • 🇬🇧
Re: Rogue Decimals in ABC files
« Reply #2 on: December 23, 2019, 12:29:31 PM »

Did you know that abcnotation.com has lots of tunes on it, and you can search that whole tune database? I knew that but it only just occurred to me: what happens if you search for 3.99999962500005, or perhaps, just 9999 ?

The answer is you get a fair number of tunes, and a lot of them quote swtoabc!

Interestingly/unfortunately though I didn't find any with the originally reported error from this post. Sorry Roger!

Also:

Further research shows that all instances of erroneous 6 note groups prefixed by (3 follow the note pattern of: (3 x x y x y z, so, similar to instances of (2 being easy 'remove this data' indicators, the 3 following notes after a (3, if they contain floating point numbers, ought to be removed, as these will be two duplicates of the 1st note, and 1 duplicate of the 2nd note - the correct pattern being only the last 3 notes.

This provides a pathway to repair:

If you have an ABC file with errant floating and strange triplet/duplet marks:

Step 1: Find any+all (2 x FP/FP x FP/FP patterns and remove (2 x FP/FP
Step 2: Find any+all (3 x FP/FP x FP/FP y FP/FP x FP/FP y FP/FP z FP/FP patterns and remove (3 x FP/FP x FP/FP y FP/FP
Step 3: Find all x FP/FP y FP/FP z FP/FP patterns and replace with (3 xn yn zn, where n will be selected based on value of (FP/FP) / (2/3).

What we lose is the knowledge of which notes are supposed to be slurred, or tied, which personally I feel can be open to interpretation anyway. We could optionally add a "_comment" when we perform step 1 & 2 to indicate that this took place, but, this may interfere with step 3 if the slur/tie caused note duplication between groups of triplets.

To explain step 3 a little, we can calculate the value of n, the correct note length, without knowing anything about the rest of the tune, as we know a note sequence of x2/3 y2/3 z2/3 is equal to (3 x y z, remembering that (3 means triplets, which is 3 notes but in the space of 2. So, because in this case all of these ratios are as a result of attempts at triplets, if we look at the nonsense ratio in the file and divide it by 2/3, we should get always some factor of 2. For example if we see a ratio that is (basically) 1/3, we know we want 3 notes in the space of only 1, so, they'd have to be half as long as normal, so know to replace it with (3 x/y/z/. Likewise, if we see 4/3, we know we want 3 notes to take up the space of 4 notes, so we must have (3x2y2z2 etc. Fortunately, since our floating point values like 3.99999962500005/5.99999925000009 come from a program trying to say '4/6', rather than as a result of any calculation that compounded precision issues, we can use text analysis and scan for 3.99999962500005/5.99999925000009, as it will always be this exact sequence of decimal values, and we can just replace it with the text we know to be correct, in this case (4/6) / (2/3) = 1, so the correct thing to do is just remove this ratio entirely and put the triplet indicator in front of the 3 notes.

If you have this problem in 1 or 2 tunes and you're aware of it, you should be able to solve this by eye relatively quickly. If you have a lot of files and don't know if one of them has this problem, or, you have a large number that you wish to repair very quickly, the best way to achieve each of these steps is via application of regex, aka regular expressions, as while these look crazy & complicated, they let you specify patterns rather than specific words, phrases, or character sequences.

I will, at some point once I've actually written and tested them, provide a sequence of regex find/replaces to apply to the files I have that can yield information, as well as reparative, results.
Logged

Roger Hare

  • Hero Member
  • *****
  • Offline Offline
  • Posts: 828
  • Urmston, Lancashire, U.K.
Re: Rogue Decimals in ABC files
« Reply #3 on: December 25, 2019, 05:35:45 AM »

...Interestingly/unfortunately though I didn't find any with the originally reported error from this post.
Sorry Roger...
Eees not a problem! In my initial post, I used a 'made-up' number which looked something like what
I had seen - because at the time, I couldn't find a genuine example...🎅

Roger
Logged
For more about Manchester Morris, The Beech Band Folk Club or anything else,  please use the private messaging facility.
My (large) ABC Tune Book is here.
Pages: [1]   Go Up
 


Melodeon.net - (c) Theo Gibb; Clive Williams 2010. The access and use of this website and forum featuring these terms and conditions constitutes your acceptance of these terms and conditions.
SimplePortal 2.3.5 © 2008-2012, SimplePortal