Bug Dissection: Comparing Filenames

I think it’s good to confess the errors you made when you created bugs in your software. It makes you a better programmer to write out the process of creating and fixing the bug, and might help someone else at the same time. This is one of those confessions.

Castalia often has a need to determine whether two filenames are the same. For example, the code that maintains internal data about the code you’re working on very frequently checks to see if you’re still working on the same file you were last time. It does this by comparing the name of the file you’re working on with the name of the file you were working on last time it checked. If they’re the same, it’s the same file. If they’re different, then you’re working on a different file.

Easy enough, right? Filenames are just strings, so it’s a simple string comparison.

Of course, there has to be a complication. The complication is that the Delphi ToolsAPI doesn’t always give you the filename in the same format. It’s usually the full path to the file, but occasionally, it will be just the extracted filename.

For example, when Castalia gets the name of the current file from the ToolsAPI, it will probably get a string like ‘c:\users\jacob\documents\project1\MainForm.pas’, but it might just get ‘MainForm.pas.’

So, I wrote an IsSameFile routine that would take two filenames as strings, and compare them in both cases (full path, or just filename), to determine if they are, in fact, the same file:

function IsSameFile(N1, N2: string): Boolean;
begin
  Result := CompareText(N1, N2) = 0; //Usually this works, for full paths
  if not Result then //If it's false, maybe we have a simple filename. Check that
    Result := CompareText(ExtractFileName(N1), ExtractFileName(N2)) = 0;
end;

This worked great, until someone came across an edge case that proved to be a bug in this function.

See if you can spot it. I’ll wait…

OK, did you find it?

Here’s the problem: What if the “short” filenames are the same, but are, in fact, different files? For example:

N1 is ‘c:\users\jacob\documents\project1\MainForm.pas’

N2 is ‘d:\projects\project2\MainForm.pas’

The function above will first try to compare N1 and N2, and then when that’s false, will compare the “short” file names, which happen to match, even though it’s painfully obvious that these are not, in fact, the same file.

There are two fundamental errors in logic here. One is thinking there only two cases: 2 full paths, or 2 “short” names. The second is thinking that the second case is always triggered by the first case being false.

The fixes:

First, realizing that there are actually three distinct cases for comparing the two filenames:

  1. Compare two filenames with full paths
  2. Compare two “short” filenames, with no path
  3. Compare one “short” filename with one full path

Second, these cases need to be detected independently. Just because the first case is false doesn’t mean that wasn’t the right case. There needs to be a mechanism to determine which case to use, and then compare the appropriate strings.

Here’s the fixed IsSameFile function:

function IsSameFile(N1, N2: string): Boolean;
var
  S1, S2: string; //Short filenames
begin
  S1 := ExtractFileName(N1);
  S2 := ExtractFileName(N2);

  if (S1 <> N1) and (S2 <> N2) then
  begin
    //2 full path
    Result := CompareText(N1, N2) = 0;
  end else
  if (S1 = N1) and (S2 = N2) then
  begin
    //2 filename only
    Result := CompareText(N1, N2) = 0;
  end else
  begin
    //1 short, 1 path
    Result := CompareText(S1, S2) = 0;
  end;
end;

This handles all three cases correctly, and more importantly, determines which case to use the right way.

This addressed a bug where Castalia’s navigation toolbar would get confused if you opened two files with the same filename, either at the same time, or in sequence (for example, working on MainForm.pas in one project, then closing that project and immediately opening MainForm.pas from another project). That bug is fixed, thanks to the fixed IsSameFile() function, as of Castalia 2014.5.

14 Responses to Bug Dissection: Comparing Filenames

  1. Jeff May 12, 2014 at 9:09 am #

    That’s not an “edge case”, it’s a compete oversight caused by only considering a trivial application structure.

    How to explain a straightforward error to your customers without accepting blame? Blame the situation that caused the error and call it an “edge case”.

    Also if you have two “short path filenames” then you can’t be sure whether they are the same or different files at all. In fact the whole strategy is doomed to failure.

    Good luck with that.

    • jacob May 12, 2014 at 10:07 am #

      Hi Jeff,

      I agree. It’s a complete oversight caused by not thinking the scenario through properly. This is true of every bug I’ve ever created (or seen anyone else create).

      As for the “two short filenames” scenario, you’re absolutely right (I wondered how long it would take before someone noted that). The bad news is, as you noticed, there really isn’t any way around it. The good news is that in testing, including testing the situation that led to the original bug, that scenario never occurred. Without knowing how the IDE is internally managing those filenames, I can’t know that the situation will never occur, but the worst thing that can happen is that navigation toolbar shows you navigation info for the wrong file. I consider this to be an acceptable risk when compared to the benefits.

      • SilverWarior May 13, 2014 at 4:34 am #

        While I’m not sure about this but based on Project manager observation I asume IDE internaly manages filenames in two ways:
        1. Filename with relative path to the project you are working on
        Such filename could be easily represented with short filename as you know it is located within your project directory
        2. Filename with full file path
        Suhc filenames are probably the ones outside of your projects folder like files beloging to another project.
        So I recomend you make yourself a test case where you will check to see if filenames for files which are inside of your project folders are always the short ones. If so then all you need is add your project path to them in order to do their full name comaprison.

        As far as go with probles of two paths pointing to the same file:
        This is actually a great dificulty introduced with implementation of virtual paths in Windows Vista. If my memory serves me correctly there is an API call which retrieve you true file path.nn1

    • Christoph Engelhardt May 12, 2014 at 10:29 am #

      Hi Jeff,

      Maybe I have read that blog post in a completely different mindest, but Jacob writes about “confessions” in the first paragraph. That – to me – is enough of “accepting blame”.

      I agree that it is a bug. And like all bugs it was introduced by human error and assuming a (more) trivial application structure. But how often do you see a post-mortem with the actual code? How often do you see them outside of TheDailyWTF ? (I am sure that we both agree that this is hardly a TDWTF-type of bug)

      tl;dr: Yes it is a bug, but in my book Jacob’s got my respect for owning up to it in the way he did.

      • Jeff May 13, 2014 at 3:08 am #

        Fair point, but the confession merely skirts around the main problem – the fact that he cannot tell absolutely whether the current file is the same as the last file even if the names are the same.

        So whatever processing he does, based upon that assumption cannot be relied upon to act on the right file. Fortunately the “risk” of it being wrong is not significant. If the possible outcome was that the plug in deleted the latest version of the current file from the last project you worked on, that would be different.

        Maybe I was a bit harsh, but on first (and second readings) it seemed to me that the delphi API function was taking the blame for being unreliable and that Jacon had missed the *real* killer problem ie that it is simply impossible to rely absolutely on his solution ofr determining whether it is actually the same file.

        No amount of “fessing up” will cover that.

        But kudos to Jacob for bringing it to other people’s attention.

  2. SpeedFreak May 12, 2014 at 9:51 am #

    Don’t forget that filenames are unicode.

    You should use AnsiCompareText or your test will fail on comparing something like Æblegrød and Ablegrod (I speak from experience :-)).

  3. Stefan Glienke May 12, 2014 at 10:51 am #

    There are not 3 but 5 cases:
    – two full paths
    – two file names that point to the same file
    – two file names names that point to different files
    – one full path and one file name that point to the same file
    – one full path and one file name that point to different files

    • jacob May 12, 2014 at 11:01 am #

      Hi Stefan,

      If you wrap the results into the cases as well, then there are 6, since you need to consider two full paths to different files, and two full paths to the same file.

      But each of these is really just 3 cases with two different results (3 x 2 cases), which is what the code handles.

      • Stefan Glienke May 12, 2014 at 5:35 pm #

        That is not true. In the case of two full paths you can be sure that your comparison returns the right result.

        In the other cases you are making assumptions that might not be true.

  4. Bogdan May 12, 2014 at 11:21 am #

    There is another case: two different full paths that point to the same file.

    C:\folder\file.txt
    C:\folder\..\folder\file.txt

    The second one is a perfectly valid path that may result from concatenating an absolute path with a relative path.

    • jacob May 12, 2014 at 11:30 am #

      Good point. I thought about this too (and even weirder cases like two completely different paths that point to the same actual file via symlinks).

      In the end, for Castalia’s purposes, this function is a a performance optimization. Castalia would work just fine by rebuilding the data structure for your file every time it displays info about it, but if it’s still looking at the same file it was last time, there’s no need to do that. For that reason, false negatives aren’t really a big problem. If it thinks the files are different when they are, in fact, the same, it just rebuilds the metadata when it didn’t really need to. The big problem is false positives, where Castalia thinks two files are the same when they’re actually different.

      With that consideration, I decided that the “expense” of looking at these other cases, compared to their rarity, didn’t justify creating a special case for them.

  5. Thomas Mueller May 13, 2014 at 5:54 am #

    Just nitpicking: There is a SameText function which you could use instead of CompareText = 0.

  6. Constantin May 13, 2014 at 6:44 am #

    For our VCS we have to add support for mapped drives. For example, the file can be opened as ‘c:\vcs\vcl\14.1\Grid\cxGrid.pas’ and ‘x:\Grid\cxGrid.pas’

  7. Sebastian Jänicke May 13, 2014 at 9:25 am #

    Delphi itself has a similar problem. Sometimes it finds the wrong file by using “find declaration” (same filename, different file path).

For programmers, by a programmer

Hi. My name is Jacob, and I'm the creator of Castalia.

I starting programming in 1986, learning Lightspeed Pascal on a Mac Classic. Today, I'm a professional programmer, teacher, and entrepreneur.

I have a Master's Degree in Computer Science, and I still love Pascal and Delphi.

I believe that writing code is the heart and soul of software development, and I love helping programmers write code more effectively.