Tk Source Code

View Ticket
Login
Bounty program for improvements to Tcl and certain Tcl packages.
Ticket UUID: e3888d582045d0817375b66ae7f5e46f93f809e9
Title: Grab on master prevents mouse pointer warp into slave widget
Type: Bug Version: 8.6.9
Submitter: erikleunissen Created on: 2019-11-09 20:23:45
Subsystem: 69. Events Assigned To: fvogel
Priority: 5 Medium Severity: Minor
Status: Closed Last Modified: 2020-07-06 22:34:08
Resolution: Fixed Closed By: fvogel
    Closed on: 2020-07-06 22:34:08
Description:
If there is a grab on the master, the mouse pointer doesn't warp into the slave widget [*].
I'm unsure whether this effect of a grab is intended.

Under Linux, AND NOT under Windows, this obstructive effect of the grab doesn't show if the mouse pointer was moved into the master and out again before the grab on the master was in effect (a warp "prologue"). So whatever the intended behaviour with the grab is, there seems to be at least one inconsistency somewhere.

The issue can be exercised with the attached script exercise.tcl.
This script accepts two boolean parameters to indicate:
- whether the grab on the master will be set or not.
- whether the warp prologue will be exercised or not
Alternate settings for these arguments show the difference that the grab and the prologue make.

[*] Once in a while there is an occasion, where running the test script under Linux (without the prologue) doesn't exhibit the obstruction of the mouse pointer. I didn't find a pattern in the occurrence of these exceptions.
User Comments: fvogel added on 2020-07-06 22:34:08:
Done as proposed below: core-8-6-branch has the fix with idle warping, and trunk has the fix with synchronous warping.

fvogel added on 2020-07-05 18:15:15:

Thanks for your testing results. So everything is now okay. Good!

About merging, what we could do is merge the bug-e3888d5820 (async) bugfix branch into core-8-6-branch (i.e. in the maintenance branch of 8.6), and merge the bug-e3888d5820-alt1 (synchronous) bugfix branch into trunk (from which 8.7 releases are created). That would minimize the risk of breaking anyone's 8.6-line code that would rely on warping happening at idle time.


erikleunissen added on 2020-06-17 14:36:35:
And the test results for other Linux window managers:

               5c500188    273ef8a6
TWM: 
bug present     FAIL         FAIL
bug fixed       PASS         PASS

ICEWM: 
bug present     FAIL         FAIL
bug fixed       PASS         PASS

All is well.

erikleunissen added on 2020-06-15 18:09:11:
Confirming correct test results of test bind-35.1:
- on KDE/Plasma,
- with and without the Aurorae-theme line in ~/.config/kwinrc,
- with the tip of each bug branch (idle warping: commit 5c500188, synchronous warping: commit 273ef8a6),
- for bug fixed (PASS) and bug present (FAIL), using runtime-loading of libtk8.6.so from Tk release 8.6.10 for "bug present" as explained previously.

fvogel added on 2020-06-14 16:37:38:

Correct, it seems that in the quest for robustness and for understanding what was happening with your system 1 I had lost failure of bind-35.1 in core-8-6-branch for the bug-e3888d5820 branch.

I have reworked the test once again, simplified it a bit and have removed the testgrab proc and constraint in bug-e3888d5820. This is [5c500188].

Now in both bugfix branches (bug-e3888d5820 and bug-e3888d5820-alt1) the bug is fixed (respectively with idle warping and synchronous warping), and test bind-35.1 passes in the bugfix branch while it fails in core-8-6-branch (checked on Windows and Linux Debian 10 with KDE/Plasma with the aurorare line in kwinrc).


erikleunissen added on 2020-06-11 15:15:05:
"tkTest.c" not "tkText.c" (of course)

erikleunissen added on 2020-06-11 15:11:09:
Right!

Well I did:

1. Build c5754760, including tktest;

2. Ensure that the newly built tktest executable loads (at runtime) the libtk8.6.so binary from release Tk8.6.10 instead of the the binary that has just been built from c5754760. (I control this by swapping symlinks to the binary in the LD_LIBRARY_PATH from where tktest picks it up).

3. Run from the build directory:

      tktest TESTFLAGS="-file bind.test -match bind-35.1 -verbose beps"

This way the test runs fine, the command testgrab exists in tkText.c which is built into in the tktest binary, not into libtk8.6.so.

fvogel added on 2020-06-10 20:55:17:
Well, I didn't try because this test should be skipped due to constraint 'testgrab'. How can you make this test fail since it needs the 'testgrab' new test command I have implemented only in branch bug-e3888d5820? Did you also backport the testgrab command perhaps?

Note that the current core-8-6-branch includes another test named "bind-35.1" that is specific to aqua (and therefore skipped on Linux) and has nothing to do with the present ticket (we will need to renumber our bind-35.1 to bind-36.1 before merging to avoid the conflictà

erikleunissen added on 2020-06-10 20:27:49:
Completing previous post: that is, with KDE/Plasma and aurorae config file kwinrc.

erikleunissen added on 2020-06-10 20:25:34:
Test bind-35-1 in branch bug-e3888d5820, latest commit c5754760 passes with Tk release 8.6.10 (bug present).

Do you reproduce?

fvogel added on 2020-06-08 09:00:20:

Do I understand it correctly, that the synchronous bug fix makes the tail, head and mark options obsolete?

No it doesn't. The -when option of event generate applies (in all branches) to the event being generated, not to the warping. This has always been the case and it does not change. For instance:

    event generate $w <Motion> -warp 1 -x 10 -y 10 -when tail
will queue a <Motion> event for window $w in all cases. It will also schedule warping at idle time (in the async branch or core-8-6-branch). In the synchronous (-alt1) branch however this command will not queue the warping but execute it immediately and return only when warping was performed. The queuing of the <Motion> event will still happen.


erikleunissen added on 2020-06-08 08:28:02:
Uhmm, I withdraw the "undocumented" part of my considerations.

man event says:

------
       -when when
              When determines when the event will be processed;  it must have one of the following values:

              now       Process the event immediately, before the command returns.  This also happens if the -when option is omitted.

              tail      Place the event on Tcl's event queue behind any events already queued for this application.

              head      Place the event at the front of Tcl's event queue, so that it will be handled before any other events  already
                        queued.

              mark      Place  the event at the front of Tcl's event queue but behind any other events already queued with -when mark.
                        This option is useful when generating a series of events that should be processed in order but at the front of
                        the queue.

------

Apparently there is a basis for programming these events. Do I understand it correctly, that the synchronous bug fix makes the tail, head and mark options obsolete? That would be breakage of documented behaviour and there is more to consider ...

erikleunissen added on 2020-06-08 08:02:05:
Considerations for selecting the solution to merge.

If a user has been relying on the order of execution in a sequence of commands, where one of the tasks is an idle warp event, then I expect that the other events in the sequence also involve screen updates. In that case the user has been relying on something very intricate, which may depend on the wm, and which is undocumented.

The primary example of a case where a user relies on such understanding is (y)our own test bind-35.1 in the asynchronous bug branch. As I have made clear in this ticket, I (still) have trouble understanding what exactly it is that each forced sync ([after ms], [update], [update idletasks]) in that test is waiting for. It exemplifies the card castle, as you aptly named it.

I know that you do understand the card castle (after a lot of analysis), but I think that very little other users do. And if they do, they also would need to have analysed the C code. That's not the type of user/programmer that Tcl is targetting (as a policy), and no regular programmer would want such understanding as a prerequisite/basis for programming defined behaviour anyway. 

An entirely different possibility is that some user/programmer relies on the order of execution based on the outcomes of a lot of trial and error with this house of cards that he doesn't understand. That also is not a use case that Tcl needs to target as a policy.

Personally, I wouldn't mind such use cases to break.

So I am in favor of merging the solution in the synchronous branch, issue a ***POTENTIAL INCOMPATIBILITY***, together with a short explanation of the type of incompatibility that is to be expected, so that use cases where the incompatibility applies can identify themselves.

Please note that I'm not used to operate at a policy level, and I'm sure I do not oversee the entire area, especially other related developments.

So these considerations are my 2 cents.

fvogel added on 2020-06-07 19:50:06:

About B, section "not reproducing": I also have checked again and I confirm my results. This is a bit odd.

Considerations about synchronous warping;

b. I don't have the answer to this question, I'm also wondering. Warping has been executed at idle time from the very beginning, when -warp got introduced in Tk (see [363f59223d]).

c. I also wondered about the risk of someone relying on warping at idle time. I had thought of the interaction of -warp and -when but couldn't demonstrate any issue. Based on your script I have however crafted a special thing to demnstrate a change:

    package require Tk
    wm geometry . +100+100
    pack [label .l -text "X"]
    update
    bind .l <FocusIn> {puts [winfo pointerxy .]}
    .l configure -text "XXXXXXXXXX"
    event generate .l <Motion> -warp 1 -x 10 -y 10 ; focus -force .l

With synchronous warping, this always returns the same coords (134 138 for me on Windows). The mouse coordinates are read *after* warping. But with idle warping the output is whatever coordinates the mouse had *before* warping.

So it's possible to demonstrate a difference in behavior. Whether this different behavior is better depends on the use case: in the above example the programmer is clearly expecting the synchronous behavior. To get it the programmer would have needed to use "update idletasks" before focussing the label, which is no longer needed in the synchronous case.

Note that, right now, the current tip of both bugfix branches contains a working solution with the adequate bind-35.1 test version (adapted in each branch to the selected solution). Your app and everything should be okay in either bugfix branch, that is either with bug-e3888d5820 or with bug-e3888d5820-alt1. The difference is that the former still uses an idle warping task while the latter uses synchronous warping. I'm mixed about which one should be merged: the alt1 branch perhaps seems better (less forced sync in bind-35.1 and cleaner approach, but slight risk of breaking something that relies on the fact that warping is run as an idle task). With idle warping the sequence of events to respect to make grabbing work is sort of a card castle that does not look very robust to me. I'm inclined to use the alt1 branch, with a "***POTENTIAL INCOMPATIBILITY***" warning in the release notes.

Regarding the improvement you are suggesting for bind-35.1, yes. We could even be much more specific by using:

    expr {$x1==$x2 && $y1==$y2 && $x1==[winfo rootx .top.l]+10 && $y1==[winfo rooty .top.l]+10}


erikleunissen added on 2020-06-07 19:06:33:
I see one more opportunity for improvement of the test bind-35.1.

The latest change with commit 9f7e6ca9 assumes that width and height of window decorations are at most 10 pixels:

    expr {$x1==$x2 && $y1==$y2 && $x1>=310 && $y1>=310}

What if any future fashion whim uses 20 pixels? You cannot know.

I'd suggest to play safe, and change it to:

    expr {$x1==$x2 && $y1==$y2 && $x1>=[winfo rootx .top] && $y1>=[winfo rooty .top]}

erikleunissen added on 2020-06-07 18:12:23:
A. RESULTS OF TEST bind-35.1 FOR CHECKIN 9f7e6ca9
=================================================
Test runs use the file bind.test from checkin 9f7e6ca9.
Test runs have been performed under KDE/Plasma, with the config file kwinrc that has the aurorae theme engine line (as uploaded previously).

The test results are as desired now:
- test PASSes with the fix
- test FAILs without the fix (using libtk8.6.so from Tk release 8.6.10)


B. RESPONSE TO PREVIOUS REMARKS (FV, 2020-06-06 15:08:16)
=========================================================
W.r.t.

    "My understanding is that your report is with bug-e3888d5820-alt1"

Correct!


W.r.t.

    "I'm not reproducing this. Adding update idletasks; after 50 after each event generate... line of test bind-35.1 at commit 11ecc0a8 from branch bug-e3888d5820-alt1 should have no effect (because warping is synchronous there), and this is what I'm observing."

I entirely understand what you write. I was surprised myself that restoring the commands related to idle callbacks made any difference for PASS/FAIL of the test with the fix in place.

I checked again (now with test bind-35.1 in commit 9f7e6ca9): the test FAILs again with the fix in place. Although this is strange, I believe that our differing results w.r.t. this phenomenon are not relevant anymore to the issue in this specific ticket since the unmodified test behaves as desired in both our test environments (see A above). What's your take?


GENERIC CONSIDERATIONS REGARDING SYNCHRONOUS WARPING
====================================================
A few generic considerations regarding the new synchronous approach to pointer warping in branch bug-e3888d5820-alt1:

a. this simplifies the code a lot and removes room for error.

b. it makes me wonder: was there originally a specific reason for implementing pointer warping as an idle callback other than "this is a screen update and all screen updates are collected until and executed at idle time"?

c. is it possible that the introduction of synchronous mechanics for pointer warping changes the chronological order of physical screen events, because (most/all ?) other screen updates occur at idle time? Consider this example:

	pack [label .l -text "X"]
	.l configure -text "XXXXXXXXXX"; # command A
	event generate .l <Motion> -warp 1 -x 10 -y 10; # command B

With a synchronous warping strategy, I can imagine that the chronological order of the screen updates that correspond to commands A and B, is the reverse of the order in which these commands are issued.

I investigated this specific example, and I didn't detect a reversal here. Nevertheless, I thought it wise to mention this as a (theoretical) possibility to you. Maybe you can see other cases than  my example, where it is relevant. Or maybe you can disprove this concern. On the other hand, if such reversal can happen, than we'd have a "POTENTIAL INCOMPATIBILITY".
--

fvogel added on 2020-06-06 20:14:56:

I think that now at [9f7e6ca924] in branch bug-e3888d5820-alt1 everything should be okay. The same test bind-35.1 passes in the bugfix branch with or without aurorae theme, and fails in core-8-6-branch. Besides you already confirmed that your application sees the bug fixed and that you have detected no regression.

Nevertheless, may I ask you to check once more? You seem to have a skill in finding issues that went through. Thanks!


fvogel added on 2020-06-06 15:14:42:
Note: my previous message was for my system without the aurorae thing. I'll also be trying with this theme in the kwinrc file.

fvogel added on 2020-06-06 15:08:16:

With commit [11ecc0a8] from branch bug-e3888d5820-alt1, test bind-35.1 passes without update idletasks; after 50 because warping is now synchronous in that branch (that is event generate xxx <Motion> -warp 1 does no longer return until warping happened - Previously warping happened as an idle event).

In core-8-6-branch, the same test (i.e. without update idletasks; after 50) passed because the criterion for test success was not sufficiently specific. I have improved this so that the same test now fails in core-8-6-branch and passes in bug-e3888d5820-alt1. See [f641e3b101].

Therefore, I restored them, but that makes the test FAIL with the fix in commit 11ecc0a8.

I'm not reproducing this. Adding update idletasks; after 50 after each event generate... line of test bind-35.1 at [11ecc0a8] from branch bug-e3888d5820-alt1 should have no effect (because warping is synchronous there), and this is what I'm observing.

Regarding your message on 2020-06-06 12:24:10:

- My understanding is that your report is with bug-e3888d5820-alt1 (please raise your hand if this is incorrect).
- Your testing with your large application exercising grabs shows (beyond bind-35.1) that the synchronous approach for warping works for you. Even if I was expecting this, that's cool!
- About running the full test suite, yes there are a lot of (unrelated) failures on Linux. On Debian 10 I'm seeing 68 failures, which is a bit less than you. I don't think there is anything relevant to this ticket elsewhere than in bind.test.


erikleunissen added on 2020-06-06 12:24:10:
Additional results on Linux/KDE/Plasma/aurorae:

* Ran the entire test suite. This always produces a lot of errors which are not related to this ticket. Nevertheless, here it is (only global reporting):

Tests ended at Sat Jun 06 14:13:02 CEST 2020
all.tcl:        Total   9635    Passed  8559    Skipped 962     Failed  114
Sourced 92 Test Files.
Files with failing tests: focus.test font.test frame.test grid.test listbox.test raise.test spinbox.test textDisp.test textIndex.test unixEmbed.test unixFont.test unixWm.test winfo.test wm.test

So: bind.test, grab.test and textTag.test all without errors.


* My large application, which has a GUI that uses quite some grabs, didn't show any errors or peculiarities when handling the GUI.

erikleunissen added on 2020-06-06 11:26:35:
I've taken the source files from commit 11ecc0a8 on the branch bug-e3888d5820-alt1  and run the test[*] from the build dir on my system 1 with the KDE/Plasma/aurora wm (the usual setup).

    [*] using the standard test invocation as used in this ticket

Indeed, the test included with commit 11ecc0a8 PASSes with the code in that branch. However, that test also PASSes if run against the good old Tk8.6.10 where the bug is still present. That's doesn't seem right.

At first I believed that the unexpected PASS for Tk with bug was due to the removal of all combinations of:

   update idletasks; after 50

in the new branch bug-e3888d5820-alt1 (i.e. starting with and including commit d9d8c6bf).

Therefore, I restored them, but that makes the test FAIL with the fix in commit 11ecc0a8. Apparently there's more to it than I believed.

fvogel added on 2020-06-02 22:16:53:

Yes your comment makes sense. I have now implemented another approach to fix this bug, could you please test with branch bug-e3888d5820-alt1? This branch passes bind-35.1 for me on Debian 10 with or without the aurorae thing in the kwinrc file, and also on Windows Vista and macOS Catalina. Please try the test suite, but also your large real world application for which you originally opened this ticket.


erikleunissen added on 2020-06-01 11:21:24:
I've been thinking about:

   "So now that we have understood that it's the aurorae theme that slows down KDE ..."

I'm unsure about this understanding, when considering the side-by-side diff that I posted previously[*] when I was comparing the behaviour of various window managers on system 1.

At that time, you responded with "At this point I don't know why the added "after 200" seems to trigger two additional calls to TkPointerEvent ..."

The observation that *additional* calls to TkPointerEvent() occur with aurorae, suggests that the program execution takes a different path, rather than just slowing down.
--

[*] initially on 2020-02-29 18:35:36, in section B2, but signicantly corrected with wm_diff.2.zip on 2020-03-01 16:05:32.

fvogel added on 2020-05-24 19:56:42:

Yes, this line appears to be valid. What seems apparent is that when present it slows down KDE quite a bit, to the extent the test fails. So now that we have understood that it's the aurorae theme that slows down KDE, we can simply add some time (say, after 200) in bind-35.1 to add some robustness there, and commit all this.

However I got a suggestion from Marc Culler that I would like to experiment first. Stay tuned!


erikleunissen added on 2020-05-19 19:42:20:
On my system 1, after a pristine login from a new user into KDE/Plasma, the following configuration actions in *my* graphical configuration interface:

- select Application Style in category Appearance;
- select category Window Decorations;
- click button Get New Decorations;
- select to install "Ambient-Blue-Aurorae" (or something the like?)
- click Close button
- click Apply

produce a file ~/.config/kwinrc with the following contents:

<verbatim>
[Compositing]
OpenGLIsUnsafe=false

[org.kde.kdecoration2]
BorderSize=Normal
ButtonsOnLeft=MS
ButtonsOnRight=HIAX
CloseOnDoubleClickOnMenu=false
library=org.kde.kwin.aurorae
theme=__aurorae__svg__Ambient-Blue-Aurorae
</verbatim>

Therefore, the line

<verbatim> library=org.kde.kwin.aurorae </verbatim>

appears to be current and valid.

erikleunissen added on 2020-05-18 09:26:11:
About Aurora:

I am not aware of any KDE desktop setting that I configured involving the name "aurorae"; the name is new to me. So I did an internet search. From what I saw, I conclude:
- Aurorae is the name of a "theme engine"
- It's the default KDE built in theme engine
- It's a current engine, under active maintenance.

I do remember to have configured desktop themes in the past. When I launch the graphical configuration interface on my OpenSuSE installation I recognize theme names like "Breeze", "Air", "Oxygen" and an option to "Get New Themes". I'm sure to have fiddled with them in the past. But I don't know when it ought to show up in the kwinrc config file.

You seem to question the righteousness of that line in the kwinrc config file, which is hard to say:

a. The internet search shows that "Aurorae" is a current KDE component. I expect it to play a role when configuring (current) KDE desktop aspects, especially desktop themes. But please realize that I don't know what exactly a "theme" encompasses and what not (wallpapers, window decorations, cursor themes, widget properties, ... ). 

b. Could it be a left-over of a previous configuration? Maybe, but since "Aurorae" is a current KDE component, this may be a moot point. I never had any reason to look at the config file when changing themes in the past. I'm willing to do some experimenting to find out by changing "theme" settings through the config interface and inspect whether the resulting changes to the config file encompass a:

        <verbatim>  library=org.kde.kwin.aurorae  </verbatim>


c. Finally, there appear to be related upstream issues. The following KDE ticket might be relevant:

        https://phabricator.kde.org/D26814

But I can't judge it's significance for our Tk issue.

fvogel added on 2020-05-17 20:12:40:

In the kwinrc file you (Erik) have provided, the culprit line is in the:

[org.kde.kdecoration2]
section and it says:
library=org.kde.kwin.aurorae

Without this line bind-35.1 passes.

With this line bind-35.1 fails. And adding "after 200" (actually after 50 is enough) at the end of proc waitForGrab (in bind-35.1) lets it pass again even with this line.

So what is this line for in the kwinrc file? Apparently some theme engine? This is not installed on my Debian 10 system. Is it installed on your "system 1"? Is this line perhaps a leftover from some old installation that was not properly cleaned up?


erikleunissen added on 2020-05-17 13:17:13:
Words fail to express my relief!

I'm so curious whether the test fails in the same way as it does with my "system 1" or whether you happened to find yet another/different issue.

Erik.
--

fvogel added on 2020-05-16 14:37:06:
@erik: Guess what? I have upgraded my Linux Debian 8 to the up-to-date Debian 10.

(Btw : so far I wrote my system was Debian 9 but it really was Debian 8).

I have:

- created a new account xxx under Debian 10
- logged in that account with KDE/Plasma desktop (to create all config files)
  --> there IS a /home/xxx/.config/kwinrc file present.
- cloned the Tcl and Tk repositories
- built Tcl and Tk (in the bugfix branch)
- run bind-35.1
  --> bind-35.1 PASSES

Then I have replaced the originally generated /home/xxx/.config/kwinrc by yours that you have attached to the present ticket, and I have run bind-35.1 again
  --> bind-35.1 FAILS!

So now I can reproduce failure of bind-35.1 with your kwinrc file. This will help me a lot in understanding how to make this test even more robust to avoid failure even when some KDE/Plasma parameter is set to some unusual value, which is what your kwinrc file must be doing.

fvogel added on 2020-05-16 07:48:22:

@marc: many thanks for these tests. You are confirming what I have seen on my side.

The code I removed in commit [2d8804d0] was masking on macOs the issue reported in this ticket. I was wondering why the mac behaved in a different way. Removing that event injection code makes test bind-35.1 fail (before the fix [3bf789ff]) on the mac like the other platforms. And then commit [3bf789ff] is enough to fix all platforms.

Where we are now, in a nutshell: the bug is fixed, on all platforms. We have a new test bind-35.1 confirming this. For all plaforms/WM that we could put our hands on, or get the tests results from helpful people, we have reports telling us that this test passes. Except one system sitting on Erik's desk, "system 1", on which the test is failing. This system 1 runs KDE/Plasma.


marc_culler (claiming to be Marc Culler) added on 2020-05-13 19:18:59:
I have done some testing on macOS.  The question I was answering was whether the
change in commit [2d8804d0] was correct.  The change removed code that injected
a MouseMoved NSEvent into the system event queue.  There was a comment indicating
that it was not clear whether the injection was necessary given that it appeared
that event generate $w <Motion> -warp 1 ... would cause two identical <Motion>
events to appear on the Tk event queue.

My tests indicate that removing the code was correct.  With that code gone the
new test bind-35.1 passes in the bugfix branch and also in the tip of
core-8-6-branch.  Also, setting -warp 1 in the event generate command causes
only one <Motion> event to be queued.  The event injection was superfluous.

This makes sense to me.  One <Motion> event was being created by the event
generate command.  Another <Motion> event was being created by the mouse event
handling code when the injected NSEvent was processed.  Both XEvents are
reporting the same mouse position.  Only one of those XEvents is needed.

fvogel added on 2020-04-19 16:20:58:

One way of finding that out might be to change various KDE/Plasma settings through its configuration graphical user interface, and see in which file they are saved.

I did this already. The kwinrc file actually used is in /home/xxx/.kde/share/config

Putting your file there does not allow to trigger a failure in bind-35.1


erikleunissen added on 2020-04-19 16:09:06:
W.r.t. "Not sure this kwinrc file is read by my system."

I don't believe we can proceed in a meaningful way along this path without being sure of this. One way of finding that out might be to change various KDE/Plasma settings through its configuration graphical user interface, and see in which file they are saved.

fvogel added on 2020-04-18 14:15:05:
In addition: there is a kwinrc file in /home/xxx/.kde/share/config

If I replace it by your file, bind-35.1 still passes for me.

fvogel added on 2020-04-18 14:09:14:
I have:

- created a new account xxx under Debian 9
- logged in that account with KDE/Plasma desktop (to create all config files)

  --> there is no such /home/xxx/.config/kwinrc file for me. Could this depend on the KDE version perhaps?

- cloned the Tcl and Tk repositories
- built Tcl and Tk (in the bugfix branch)
- run bind-35.1

  --> bind-35.1 passes (as expected)

Then I have placed your kwinrc file in /home/xxx/.config/ then logged in again my xxx session, and ran bind-35.1 again --> it still passes. Not sure this kwinrc file is read by my system.

erikleunissen added on 2020-04-16 19:32:21:
Now fixed. The upload facility doesn't warn when a user makes a stupid mistake like attempting (3x) to upload a read-only file owned by another user. Mistake  corrected.

fvogel added on 2020-04-16 19:10:22:
> I attached that configuration file 

Really? I don't think I see it in this ticket, where is it?

erikleunissen added on 2020-04-16 16:23:21:
I'm on to something now:

An update:

1. The test passes consistently on my exceptional "system 1" from a fresh user account, using the same KDE/Plasma desktop as used by my own account, just fresh with default configuration files. (This was also communicated by email a few days ago.)

2. Since that experience, I've been trying to make it the test fail again on the fresh user account, by systematically replacing KDE configuration files by the corresponding copies on my account where the test fails consistently.

I was succesful quickly:

The test fails when I replace the following configuration file with the copy from my account:

    ~/.config/kwinrc

I attached that configuration file and ope that you are able to make the test fail on your computer with this file. That would enable you to do further debugging from your own computer instead of needing a cumbersome remote access to mine. (The setup on my computer is still available in the case that this configuration file doesn't make the test fail on your computer, but I hope we don't need it.)

Procedure:

- create a fresh user account "xxx" on the system with your KDE/Plasma desktop (Debian 9, I believe).
- when not yet logged into account "xxx", become root (from your regular account)
- replace /home/xxx/.config/kwinrc with the attached file
- maybe logout from your current account, I did that
- log in to the new account xxx
- run the test bind-35.1

I'm very curious after the result.

erikleunissen added on 2020-04-16 16:21:32:
.. and to bll!

erikleunissen added on 2020-04-16 16:21:04:
First a thank you to Paul and pointsman from me too. Highly appreciated.

pointsman added on 2020-04-06 11:21:41:
Debian 9.12 64-bit fvwm2 OK

fvogel added on 2020-04-05 19:15:26:

Paul: thanks. This 'const' was introduced in [32b621b3] but was reverted soon after in [fc470fc7]. This revert is not yet merged into our bugfix branch, sorry for that.

Thanks to all who already contributed their testing results! That's great, that said we're still looking at a system where bind-35.1 fails in branch bug-e3888d5820, so please continue contributing.


obermeier added on 2020-04-05 18:57:39:
Note, that I got an error message in the first attempt on Suse Leap 15.1:

unix/tkUnixFont.c:869:13: error: assignment of read-only location
  *familyEnd = '\0';

Had to change:
  const char * familyEnd;
to: 
  char * familyEnd;

obermeier added on 2020-04-05 18:54:14:
Here are the results of the "BAWT test suite" jury:

SUSE Leap     15.1  64 KDE    OK
SUSE Leap     42.3  64 KDE    OK
OpenSUSE      13.1  64 KDE    OK
Ubuntu        19.10 64 Gnome  OK
Debian Buster 10.2  64 KDE    OK
Debian Jessie  8.11 64 KDE    OK
Windows       10    64 gcc    OK

bll added on 2020-04-05 14:08:43:
MX Linux 19.1 / 64 / xfce : ok
Ubuntu 16.04 / 64 / mate : ok
OpenSUSE 15.1 / 64 / gnome : ok
KUbuntu 18.04 / 64 / kde : ok
Linux Mint 19 / 64 / mate : ok

fvogel added on 2020-04-04 19:03:39:
Sure, we're interested in any platform. Hre is what I'm using on all three platforms:

Linux:
cd /path/to/tk/unix
make test TESTFLAGS="-file bind.test -match bind-35.1 -verbose beps"

Windows with MSVC:
cd path\to\tk\win
nmake -f makefile.vc test TESTFLAGS="-file bind.test -match bind-35.1 -verbose beps"

macOS:
cd /path/to/tk
make -C ./macosx test TESTFLAGS="-file bind.test -match bind-35.1 -verbose beps"

erikleunissen added on 2020-04-04 15:20:19:
I attached a new invitation text.

One more question: my intention was to invite users on all platforms (linux, mac, windows, ..).

- Do you agree to invite all kind of platforms?
- I believe that the test invocation ("make test ... ") is not valid for  all platforms. Is it?
- Do you know the correct test invocations for mac and windows?

Erik.

fvogel added on 2020-04-04 12:52:20:
Thanks ! A few answers and comments:

Perhaps first clearly state that the bug itself is believed to be fixed in the bugfix branch. The associated associated non-regression test succeeds everywhere we know, except on a single (so far) system, and we're looking at finding why.

> - A desktop computer (for Linux: the more window managers you can run the better)

Any computer will do, not just desktops but also laptops or anything, right?

> - Tcl8.6.10 installed (*** FRANCOIS: IS AN OLDER RELEASE USEFUL? ***)

Yes (see below).

> *** ALSO OLDER? ***

We don't care. Any Tcl version against which Tk will compile is OK.

> - *** MORE, OR MORE SPECIFIC? WE DON'T KNOW WHAT WE'RE LOOKING FOR ***)

Don't know at this point. Anything looking relevant to the reporter perhaps.


Overall, looks good, thanks! One more point: we should request feedback directly in the ticket, so that all information is gathered at a single place.

erikleunissen added on 2020-04-04 09:33:36:
Attached invitation: file "CallForHelp.txt".

erikleunissen added on 2020-04-04 09:32:17:
I've composed an invitation to be posted on tcl-core, for participating in this hunt. Before posting it, I'd appreciate your review. Could you please pay particular attention to the passages marked with ***?

Thanks,
Erik.
--

fvogel added on 2020-03-31 18:13:41:
Yes this is a good idea. Prompting the Tcl core list comes to mind first, as this list has quite many people able to do such a test. Feel free to post there?

erikleunissen added on 2020-03-31 15:44:06:
OK. Clear enough. I will resume my investigation of "the darned system 1".

Another thought that occurred to me: I believe that it would be good to have a better insight in the extent to which the behaviour of my "system 1" is exceptional. That insight could result from a final outcome of my investigation of course. But what about calling out for assistance? We could target people in the Tcl community who don't object to compiling Tk themselves[*], and ask them to run test bind-35.1 from the bug fix branch and report back the results.

fvogel added on 2020-03-30 11:47:45:
We can merge at any time. The right time for merging is somewhere between now and the next release of Tk (no, I don't known when this will happen).

If you can investigate your system 1 further within a reasonable timeframe, then I'd prefer we understand the full story before merging. That could avoid some surprises that are always possible: we're obviously still not having a really complete understanding since at least one system has an issue with the test added as non-regression of the fix for this ticket.

If you cannot devote time to investigation of this system 1, then I would merge and let the Damocles sword hanging above us. I can see no better plans.

erikleunissen added on 2020-03-30 07:49:32:
Hi Francois,

No progress to mention. I'm afraid I may have got a wrong impression about your interest/priority in "the exceptional system 1".

From the post on [2020-03-09 20:16:20] I got the impression that you were preparing to merge the fix (+ test of course) at the current state of this ticket. In response to that, I've given you my personal view on the relative priority of the exceptional behaviour of system 1. So I was expecting (and agreeing in advance) that the fix in its current state be merged into the main development branches, regardless the outcome of my investigation into "system 1". As a result of that, my attention and time strayed to other issues.

I'm certainly willing to give the issue priority again, and continue the pursuit if desired. But I'm also fine with separating (in whatever shape) the pursuit of the exceptional issue from a merging action (soon).

Right now I'm confused about the plan, for which I tend to follow your direction.

fvogel added on 2020-03-29 16:47:27:
Hello Erik,
Any progress you could share, perhaps?
I'm still curious about your "system 1".

fvogel added on 2020-03-10 07:18:08:

[afer ms] is implemented here, you can study what it does exactly in the source code. It basically just does Tcl_Sleep() until the given time is elapsed. Tcl_Sleep() is implemented differently on each platform (in tclUnixevent.c, tclMacNotify.c, tclWinNotify.c - in the *Tcl* repository).


erikleunissen added on 2020-03-09 22:26:22:
Hi François,

I didn't make any (tangible) progress yet. I had lack of time in the last week, combined with a little lack of health (my mind isn't clear because of a cold).  And I think it's very inefficient for you to help in this situation. 

Personally, I'd be in favour of just merging the fix into the main development branches. I don't mind that the mysterious behaviour of "system 1" remains unsolved because the odd behaviour with the failing test on my "system 1" is probably:
a. exceptional (as I stated previously). And from your words, I believe you believe so too.
b. not to be attributed to Tk (including the fix) anyway.

If the ticket is closed, I can continue pursuing the cause for the weird behaviour on my own accord. Given the above, it may not be worth waiting for though (for the sake of Tk).

Another matter, regarding the usage of [after ms]. There is still lack of understanding on my side, in general and especially in the Tk tests suite. It's not about what the manual says that [after ms] does: "it makes the command sleep". It's that I'm unclear about:

- (in general) when the command sleeps, and doesn't respond to events, does that mean that the process (or thread) stops processing entirely?
- (specifically in the Tk test suite, and test bind-35.1) what kind of "result" or "state" is it that [after ms] is waiting for? This is puzzling to me, especially if the process has actually stopped doing any processing itself. (For something *outside* the process maybe?) I'd appreciate very much if you clarified that a bit in generic terms; no need to go into specifics for every individual case. If I am going to continue the search for the odd behaviour of "system 1", I think I need to understand this better.

Thanks in advance,
Erik.
--

fvogel added on 2020-03-09 20:16:20:
Hi Erik, did you make any progress on this? Can I further help?

I'd like to merge the fix, and the fact your "system 1" is the only one we know that shows bind-35.1 test failure is really puzzling. I'm confident I have fixed the original problem, and that only the non-regression test is failing, and it is only failing on one of your systems. Either we put further effort there to find the root cause for this failure or we can decide to leave this unsolved.

Thanks for your feedback.

fvogel added on 2020-03-01 21:46:55:

From the man page:

after ms
    Ms must be an integer giving a time in milliseconds. The command sleeps for ms milliseconds and then returns. While the command is sleeping the application does not respond to events. 

And indeed it seems that this is really exactly what it is doing. You can check it by instrumenting as follows (in the *Tcl* repository this time):

Index: generic/tclTimer.c
==================================================================
--- generic/tclTimer.c
+++ generic/tclTimer.c
@@ -1016,10 +1016,11 @@
 
     Tcl_Time endTime, now;
     Tcl_WideInt diff;
 
     Tcl_GetTime(&now);
+printf("AfterDelay: START, now waiting for %d ms\n", ms); fflush(stdout);
     endTime = now;
     endTime.sec += (long)(ms / 1000);
     endTime.usec += ((int)(ms % 1000)) * 1000;
     if (endTime.usec >= 1000000) {
 	endTime.sec++;
@@ -1089,10 +1090,11 @@
 		return TCL_ERROR;
 	    }
 	}
         Tcl_GetTime(&now);
     } while (TCL_TIME_BEFORE(now, endTime));
+printf("AfterDelay: END of waiting time (%d ms)\n",ms); fflush(stdout);
     return TCL_OK;
 }
 
 /*
  *----------------------------------------------------------------------

With all instrumentation in, the (extract of the) output is as follows, showing that "after 200" really sleeps for 200 ms and returns only after this delay without processing events:

[...]
TkPointerEvent ENTERED
  eventPtr->type is 8
  EnterNotify || LeaveNotify, this is a GENERATED_GRAB_EVENT_MAGIC
*** waitForGrab iterations: 1
*** after 200
AfterDelay: START, now waiting for 200 ms
AfterDelay: END of waiting time (200 ms)
*** event generate .top.l <Motion> -warp 1 -x 10 -y 10
[...]

At this point I don't know why the added "after 200" seems to trigger two additional calls to TkPointerEvent (actually not only this: one of the common calls has a different dispPtr->serverWinPtr set: without "after 200", it is ".", whereas with "after 200" it is ".top").

I have tried to further refine the instrumentation in bind.test to trace execution also when leaving the event, grab, after, and update commands, this helps a bit in seeing the sequence of what is happening but nevertheless I don't have a clue about why your system 1 behaves this way.


erikleunissen added on 2020-03-01 16:05:32:
I've improved the interpretability as follows:

- The instrumentation in TkPointerEvent() now reports the name of the window instead of a memory location, using Tk_PathName(). This removes a nasty suggestion caused by the substitution of memory locations after the test run.
- The instrumentation of bind-35.1 additionally prints the windows and their mapping state.

A new wm_diff.2.zip has been attached, with updated output and scripts.
The interpretation is facilitated ny seeing the window path names in the output, but I still don't grasp what [after ms] exactly does.

erikleunissen added on 2020-02-29 18:35:36:
These are first results of a further investigation w.r.t. (exceptional?) test failure on KDE/Plasma. All results below presume that:

- the fix is in place: the latest check-in on the bug-fix branch [5fa8b7e7] was used.
- test runs take place on ever the same computer system with KDE/Plasma installed where the (exceptional?) test failure was observed (i.e. my system 1, as reported previously), and where the test can be made to pass by inserting an extra [after 200], also as reported earlier.
- test runs use a tkGrab.c, with instrumentation, slighlty modified after yours (see below, and attached).
- test runs use an instrumented test bind-35.1 (see below and attached).

All instrumented source files, output files, scripts for substituting output and for generating diffs, that were used in this analysis have been attached as wm_diff.zip.
If you want to experience/see how it runs you can do that as follows:
- unzip anywhere
- symlink the script "run_test_35.1_from_build.tcl" into a Tk build dir
- run that script from there (maybe after a little tweaking, Debian may do things differently especially w.r.t. naming/registering window managers)


A. Code snippet w.r.t. numTries loop in Tk_Grab()
==================================================
I instrumented the code snippet that you indicated (see the extra instrumentation line in that code snippet in tkGrab.c.instr, attached). It isn't run once during execution of test bind-35.1.


B. Comparison of behaviour with different window managers
=========================================================

Setup
-----
Three WM's have been used: KDE/Plasma, IceWM and TWM, and the following four test run cases have been construed with them:
  1. KDE/Plasma, regular test script (test fails)
  2. KDE/Plasma with insertion of an extra [after 200] as described earlier (test passes)
  3. IceWM, regular test script (test passes)
  4. TWM, regular test script (test passes)

Procedure
---------
Target of the experiment was the instrumentation output from both TkPointerEvent() and bind-35.1. Since you came up with the instrumentation of TkPointerEvent(), I thought I'd best start with this approach. Extra instrumentation has been adde3d to test bind-35.1 by tracing the execution of the commands after, update, event and grab. Because I wanted to see the correct timely order of each printf() in TkPointerEvent() and each "puts" in bind-35.1, I added a fflush(stdout) to each printf() in TkPointerEvent() and a "flush stdout" to each puts in bind-35.1. For the sake of easy comparison, the hex values for winPtr in the instrumentation output were systematically substituted, in order of appearance in the output, by the symbols w1, w2 etc ... respectively, as to remove the irrelevance w.r.t. memory locations that differ between test runs.

B1. Comparing passed test runs (2, 3 and 4)
-------------------------------------------
A (unified) diff was made for the (substituted) output for the following combinations of test runs:
2 vs. 3
2 vs. 4
3 vs. 4

Each of these three diff outputs didn't show a single difference w.r.t. the instrumentation output (except for an extra [after 200] for case 2 in combination 2/3 and 2/4 if that command is being traced in the instrumented test bind-35.1). This shows that all window managers behave exactly the same as far as indicated by the instrumentation output.

My conclusion: the extra [after 200] in case 2 functions as a correction w.r.t. the undesired behaviour on my system 1 with KDE/Plasma.


B2. Comparing failed with passed test runs (2 and 1)
----------------------------------------------------
(B.t.w. from the previous exercise it follows that comparing 2 to 1 is equivalent to comparing 3 to 2 and 4 to 2).
For this comparison a produced a side by side diff (see attached). It shows the difference that the extra [after 200] makes for the instrumentation output in chronologic order. From that diff, it appears that two calls to TkPointerEvent() are being skipped by KDE/Plasma without the extra [after 200]. But it happens at a place in the program flow where I didn't expect it.

This analysis has produced a clear difference w.r.t. the program flow inside the Tk code that the extra [after 200] causes. Well, what's next? 

I'm having trouble interpreting this difference. A rather fundamental reason for that is that I'm confused about exactly what a [after ms] command accomplishes. Before starting with this ticket I understood that [after ms] blocks/suspends all program flow in the process. That appears not to be true when seeing its effect in bind-35.1. Therefore, I am hoping very much that you are able to indicate the significance of this difference, before me taking a next step.

If you can see directions for a next step (especially when more promising than my current approach) please let me know.

erikleunissen added on 2020-02-18 22:12:46:
Oh (our messages crossed), yes I will investigate the for loop that you indicated before starting with the KDE installation, but probably after a week or so since compiling is awkward now.

erikleunissen added on 2020-02-18 22:04:04:
Answering your questions:

- Indeed, the extra 200 ms is only needed in one place, the same place as indicated already with my post on 2020-02-01 10:56:16, section B6.

- Personally I'm not convinced that it's useful to know that the value of dispPtr->grabWinPtr changes within 10 ms after [grab $w] has been completed, if the actual status on screen lags behind with tens or hundreds of ms (for example). I don't know whether time lags of such degree are realistic, or whether the entire scenario is even possible, but I would like to know very much. I'm just being very wary of effort that is not directed or even wasted.

- Yes, I intend to further investigate the reason for the exceptional need for 200 ms extra on my system 1 when I get back (in a week or so). That is, if I can devise a systematic approach that doesn't compromise the working system that I need for my day to day development work; and the latter is a challenge. I do have a target though, and that's the KDE/Plasma installation/configuration.

fvogel added on 2020-02-18 21:10:16:

Looking at the source code once more I have perhaps spotted something.

Could you please instrument this loop and tell me whether this code is executed for you on system 1 when bind-35.1 is exercised?

It should not be exercised (it is not for me, neither on Vista nor on Debian 8), because this codepath is only run when a global grab is requested (which is not the case of bind-35.1), or when a mouse button is down when a local grab is requested (which should neither be the case in bind-35.1 but perhaps there is a possibility here). This loop contains a Tcl_Sleep(100) that might explain your delay. So please test this and report.


fvogel added on 2020-02-18 13:20:40:
Thanks for this last post that indeed removes some confusion. 

And you don't need this "after 200" also after the third call to waitForGrab in [0b310b17] ?

About your question regarding "being registered in a C variable" corresponding to "what's actually in effect on the screen": we don't really care about what is in the screen, what we really want is to know whether Tk believes that the window is grabbed (or released). And checking dispPtr->grabWinPtr is the way to know this as far as I understand the source code.

I'm out of ideas about why your system 1 needs this 200 ms additional delay, and it's even harder to guess remotely without seeing the system. Can you perhaps investigate this on your side?

erikleunissen added on 2020-02-18 13:05:45:
Where I wrote:

"When inserting an extra [after 200] (right after the first call to [waitForGrab]), the test passes in both cases."

I meant: 

right after the first call to [waitForGrab] with [0a5a92c7], and right after the second call to [waitForGrab] with [0b310b17]

erikleunissen added on 2020-02-18 11:44:56:
Counting iterations in [waitForGrab]
====================================
I inserted a puts like this (example [0b310b17], equivalent in [0a5a92c7]):

    proc waitForGrab {type grabWin} {
    # process events while $grabWin is not grabbed ($type == "grabbed"),
    # or while $grabWin is not released ($type == "released"), but don't
    # spend more than 5 seconds doing this
        set i 0
        while {![testgrab $type $grabWin] && $i < 500} {
            after 10
            update
            incr i
        }
        puts |$i|
    }


Test results with iteration count (KDE/Plasma system 1)
=======================================================
bind.test [0a5a92c7]:
|0|
|0|
The test fails as reported previously.

bind.test [0b310b17] (including compiled [testgrab]):
|0|
|1|
|1|
The test still fails  :-(  :-(

When inserting an extra [after 200] (right after the first call to [waitForGrab]), the test passes in both cases.

My reasoning about [0b310b17]
=============================
[waitForGrab] now reports, based on [testgrab], that 1 iteration (= 10 ms) is enough for a grab change to be registered in dispPtr->grabWinPtr. Are you sure that "being registered in a C variable" corresponds to "what's actually in effect on the screen"? I cannot judge that very well. If it really does, then we need to conclude that the extra 200 ms on my system 1 are needed for something else than a change in grab being effective on the screen.

B.t.w.
I'll be away from my development computers for the coming week, having only  remote access to them (very awkward). I do check my email and have Fossil/internet access, but I expect that I can't do detailed/heavy-duty investigation during that period.

fvogel added on 2020-02-17 22:14:35:

Ah yes, correct, [grab current] returns dispPtr->eventualGrabWinPtr whereas I'm interested in dispPtr->grabWinPtr. See the difference between these two here.

I have implemented something different, please check [0b310b17a7] and report whether test bind-35.1 now passes for you on your slow system 1.

Note that with my Windows Vista system at least, the while loop in proc waitForGrab runs once only (insert a puts in that loop to check this), meaning that the grab is effective (the corresponding events are processed) in less than 10 ms. I find quite strange that the same thing needs up to 200 ms on your system 1. Just for the sake of curiosity, could you add a puts in that loop and report how many times the loop executes on this system 1?


erikleunissen added on 2020-02-17 15:58:06:
Another insight gained:

The insight concerns the test bind-35.1 as of checkin [4e5c1952] with the fix in place, on my KDE/Plasma system on which I carried out all Linux inspection thus far. This test does not pass, but it does pass if I let it wait 200 ms instead of 50 ms (see previous message, 2020-02-01 10:56:16, B6).

Additional observations indicate that this slow behaviour of my OpenSuSE/Linux KDE/Plasma system is exceptional.


Specifics
=========
The test doen't pass on my OpenSuSE Linux system with KDE/Plasma (where it needs 200 ms instead of 50 ms). However, the test did pass with other desktops and window managers on the very same Linux system. Important note: this excludes the hardware or the X-server as the cause.

Also, you reported that you didn't reproduce this behaviour on your KDE/Plasma system on Debian 9.

In order to have a broader view of the variability of timings, I did an additional check with another, older computer which I rarely use, but which also has OpenSUSE/Linux installed with a KDE/Plasma desktop. For the sake of easy comparison, let's call this older system "system 2" and the system used thus far for my Linux/KDE/Plasma reports "system 1".

The CPU of system 2 is less performant and the installed software versions are older. Nevertheless, the test (bind-35.1) passes there. Therefore, I believe that the slow performance of KDE/Plasma system 1 is due to the installed KDE/Plasma version. Possibly relevant version information for system 1:
- KDE Plasma Version: 5.12.8
- KDE Frameworks Version: 5.45.0
- Qt Version: 5.9.4

Overlooking all results on all KDE/Plasma systems thus far, I think this indicates that the slow behaviour of my system 1 is exceptional.

Implications ?

erikleunissen added on 2020-02-17 15:54:23:
W.r.t. test bind-35.1 at [0a5a92c7]:

Unfortunately, this doesn't work as intended. I'm sorry.

I do understand your intention from the code in [waitForGrab]. However, it appears that [grab current] doesn't necessarily correspond with what's actually in effect on the screen.

The list returned by a [grab current] always holds the windows that were passesd as a parameter to [grab $w], and always lacks the windows passed as a parameter to [grab release $w]. It doesn't matter how soon after [grab $w] (or after [grab release $w]) [grab current] was issued. In other words, given the following two command sequences (note: with no time/waiting between the commands in a sequence):

    grab $w; expr {$w ni [grab current]}

    grab release $w; expr {$w in [grab current]}

the last [expr] command in each sequence always returns false.

Therefore, I believe that [grab current] merely returns what has been registered by a previously evaluated [grab $w] or [grab release $w].

Hence, the command [waitForGrab] doesn't wait as intended (in both cases in test bind-35.1), because the while condition is guaranteed to evaluate to false, already at the very first iteration. The test bind-35.1 still fails under OpenSuSE-linux KDE/Plasma with the fix in place.

fvogel added on 2020-02-16 20:36:14:

I have committed ([ab0f1a27]) a change in the test: instead of waiting for an arbitrary amount of time, which might be too short, I'm now waiting for the [grab current] command to return the prerequisite the test expects before going on.

bind-35.1 should not be unreliable anymore. Please confirm the behavior you observe at the tip of the bugfix branch, i.e. at [0a5a92c7].


erikleunissen added on 2020-02-04 20:34:42:
Uhmm: "heterogeneous" ought be "homogeneous".

erikleunissen added on 2020-02-02 23:08:57:
You write:

1. "People running the test suite are not so numerous in my view."

Aha! If you mean it runs mostly on machines of developers, maintainers or Tcl core members, then that's a vastly more heterogeneous group than the Tcl users (and the hardware in use). This makes the arbitrary nature of 200 ms less arbitrary, less problematic.

2. "but this is a bit of an overkill in my opinion"
I respect your opinion very much.

3. "but this ticket already took me unreasonable amounts of time, so perhaps we could decide than waiting for 200ms is Just Good Enough (TM) in the test suite, keeping in mind that the bug really is fixed"

- Again, I respect the argumentation.
- I believe that the question "Is it good enough", asks for a reference, a more general answer than just for this ticket, a policy if you will. Isn't that a matter for the Tcl core team? In any case, I don't consider myself qualified enough to make that call.

In absence of such a generic guideline/policy/standard, here's my personal opinion:

1. I follow your suggestion at 3, for all practical purposes, given the time we have in our lives, and by all means! I also want to proceed with other matters (such as doing the laundry ... ;-) ). I wouldn't use less than 200 ms though, even considering 1., because my system isn't the slowest.

2. Maybe document the arbitrary nature of this value with a small comment in the test for future reference.

fvogel added on 2020-02-02 20:30:14:
> I'm really unsure about the value of 200 ms in the realm of systems
> around the world running the Tk test suite.

People running the test suite are not so numerous in my view.

It's hard to guess what waiting time is enough, it might depend on the hardware, on the CPU use, on the OS version, or whatever.

Of course I would be more satisfied with a handshaking mechanism (some signal telling the test suite that the grab actually is in effect), but I'm not aware of such a signal. We can probably create such mechanism, Tk does it already in other areas of the test suite (it creates commands that are used by the test suite only), but this is a bit of an overkill in my opinion. I'll nevertheless dig a bit more around this, but this ticket already took me unreasonable amounts of time, so perhaps we could decide than waiting for 200ms is Just Good Enough (TM) in the test suite, keeping in mind that the bug really is fixed and that we're now only discussing reliability of the associated non-regression test.

erikleunissen added on 2020-02-02 17:51:21:
Answering your questions:

    a. does it fix the original issue you had found with your large program? 

==> yes

    b. does this large program work without new issues/regressions? 

==> yes

I understand the answer is a double yes, which then means that we now only have issues with the test but not with the fix itself. This is the really important thing, so please confirm.

==> I confirm.

The grab command is queueing events (I have added comments about this in test bind-35.1). These events have to be serviced(the grab must be really in effect) before going on with the rest of the test commands, that's the reason for the "update ; after 50 ; update" in bind-35.1. Now if your system needs more time we can increase up to "after 200", no problem.

==> My concern is the variety of systems around the world running the Tk test suite, not my own system. Is there any standard, policy or even guideline for deciding about timing issues as these? I expect that a value of 200 or even 500 ms will lead to less (reports of) test failures of the Tk test suite around the world than 50 ms.

(and i.c.m. stands for...? ice-cream man? perhaps rather "in combinatie met", in Dutch?)

==> :-) Yes, you guessed right about the Dutch abbreviation. (Maybe, then, you already know) it means "in combination with".


To conclude this "A" topic: You did all tests in "A" with [6c21b2cc]. However at the current tip ([4e5c1952]) of the bugfix branch, you do not see any triggering of hot spots, right?

==> Correct.

B. Targeted at the result of test 35.1

Your results show that on your system the unexpected failure of bind-35.1 at [3bf789ff] is independent from the hot spots (demonstrated by B5), and that your system (what Linux is this exactly again?)

==> It's openSuSE 15.0

needs more time than mine for the grab to really take effect (demonstrated by B6). Okay, fine, let's commit the increased time if this is needed on your system to properly process the events queued by the grab command, and we will be done with this ticket.

==> I'm really unsure about the value of 200 ms in the realm of systems around the world running the Tk test suite. I expect that your insight regarding this matter is better than mine. In the end, I respectfully let you be the judge about what's a wise choice in this respect.

fvogel added on 2020-02-02 14:18:31:

First, please let's go back to basics: Forget about the test suite for a moment. Considering the bugfix branch at its current tip ([4e5c1952]):

a. does it fix the original issue you had found with your large program?
b. does this large program work without new issues/regressions?

I understand the answer is a double yes, which then means that we now only have issues with the test but not with the fix itself. This is the really important thing, so please confirm.

Now some detailed answers to your observations with the test suite:

> The fix of checkin [3bf789ff] apparently also fixed some hot spot triggering. This is good, but I don't understand why/how.

[3bf789ff] changes the order in which events are serviced when warping. The behavior depends on this order, on the configured timings (read below), and on the size of the hot spot. Your observations do not look incompatible with all this.

> Something unrelated to hot spot triggering makes test 35.1 fail with tk-after-fix, something sensitive to timing. (See especially B6. i.c.m. with B2->B5 below).

The grab command is queueing events (I have added comments about this in test bind-35.1). These events have to be serviced(the grab must be really in effect) before going on with the rest of the test commands, that's the reason for the "update ; after 50 ; update" in bind-35.1. Now if your system needs more time we can increase up to "after 200", no problem.

(and i.c.m. stands for...? ice-cream man? perhaps rather "in combinatie met", in Dutch?)

> A. Targeted at understanding triggering of hot spot at (0,0)

I'm sorry, I'm running on Linux Debian 9 with KDE/Plasma and I didn't observe the behavior you describe. At [6c21b2cc] I ran bind-35.1 more than 50 times without triggering the hot spot a single time. I don't know exactly why they trigger on your system, but I suspect this is due to different settings from the ones I have.

You have to realize that the KDE/Plasma "Screen Edges" feature has two timing values that the user can adjust: "Activation delay" (default is 150 ms) and "Reactivation delay" (default is 350 ms). Activation delay is the amount of time required for the mouse cursor to be pushed against the edge of the screen before the action is triggered. Reactivation delay is the amount of time required after triggering an action until the next trigger can occur. The activation delay definitely seems to play a role in your testing, consistently with your result A5. However since you noted your result A3 and since [4e5c19528b] is already committed, there is no need/reason to add this change A5.

Basically: the hot spot triggers if the mouse pointer is close enough to it, and if it stays in that area for long enough. The "close enough" size doesn't appear to be configurable, but the "long enough" time is configurable.

To conclude this "A" topic: You did all tests in "A" with [6c21b2cc]. However at the current tip ([4e5c1952]) of the bugfix branch, you do not see any triggering of hot spots, right?

> B. Targeted at the result of test 35.1

Your results show that on your system the unexpected failure of bind-35.1 at [3bf789ff] is independent from the hot spots (demonstrated by B5), and that your system (what Linux is this exactly again?) needs more time than mine for the grab to really take effect (demonstrated by B6). Okay, fine, let's commit the increased time if this is needed on your system to properly process the events queued by the grab command, and we will be done with this ticket.


erikleunissen added on 2020-02-01 10:57:29:
Uhmm, I meant:

"In a previous post, I coupled these two and assumed that A. were the cause for B."

erikleunissen added on 2020-02-01 10:56:16:
Recap: we have the following undesired/unexplained situation, on KDE/Plasma only:

A. interference with a hot spot, but only wit tk-before-fix (checkin 6c21b2cc)
B. the test bind-35.1 does not pass with tk-after-fix (checkin 3bf789ff)

In a previous post, I coupled these two and assumed that 1. were the cause for 2.

However, after further investigation, I'm inclined to conclude that the two are unrelated as follows:
- The fix of checkin 3bf789ff apparently also fixed some hot spot triggering. This is good, but I don't understand why/how.
- Something unrelated to hot spot triggering makes test 35.1 fail with tk-after-fix, something sensitive to timing. (See especially B6. i.c.m. with B2->B5 below).

The results of my investigation are appended below.
I'm very curious what you make of this.

Erik.
--

First off: all tests have been done (from inside a tcl script) using the same invocation as before:

	exec ./tktest (the/location/of/)all.tcl -geometry +0+0 -file bind.test -match bind-35.1 -verbose beps > $outFile


A. Targeted at understanding triggering of hot spot at (0,0)
============================================================
The triggering of the hot spot with test bind-35.1 happens only with tk-before-fix, not with tk-after-fix. This is nothing new. Relevance: this may interfere with the test FAILING as expected if the grab issue exists.
Tests under A. were done with tk-before-fix (checkin 6c21b2cc).

1. invoking ./tktest while only selecting test bind-34.3 (just before 35.1) for the test run, makes the hot spot triggering go away.

2. changing the warp coordinates to (10,10) does not change the triggering.

3. changing the warp coordinates to (50,50), i.e. your commit [4e5c19528b], DOES make the triggering go way.

4. triggering also depends on the mouse pointer position *before* invoking the test run. This is related to toplevel windows other than the one specifically created for test 35.1 which are also mapped when bind-35.1 is run (selectively). If the mouse pointer happens to be inside the root window "." (with title "tktest") before invoking the test run, then the hot spot is NOT being triggered, otherwise it IS being triggered.

5. the following change to the test script (as it was before checkin 4e5c1952, i.e. the warp coordinates for the top left screen corner are still (1,1) ) makes the triggering go away (regardless point 4.):

--- bind.test.orig	2020-01-19 15:41:28.000000000 +0100
+++ bind.test.new	2020-02-01 10:46:55.000000000 +0100
@@ -6690,6 +6690,7 @@
 } -result {0 0 ok ok}
 
 test bind-35.1 {pointer warp with grab on master, bug [e3888d5820]} -setup {
+	after 500
     event generate {} <Motion> -warp 1 -x 1 -y 1
     update idletasks ; # DoWarp is an idle callback
     after 50         ; # Win specific - wait for SendInput to be executed


B. Targeted at the result of test 35.1
======================================
All experimentation under B. has been carried out with tk-after-fix (checkin 3bf789ff).

1. after changing (in bind-35.1) the warp coordinates for the top left screen corner from (1,1) to (10,10) or (50,50), the test still doen't pass.

2. The change mentioned under A5. does NOT make the test pass.

3. The following change to the test script does NOT make the test pass:

--- bind.test.orig	2020-01-25 14:06:51.000000000 +0100
+++ bind.test.new	2020-02-01 11:08:56.000000000 +0100
@@ -6690,6 +6690,9 @@
 } -result {0 0 ok ok}
 
 test bind-35.1 {pointer warp with grab on master, bug [e3888d5820]} -setup {
+	set ::_twait 1
+	after 2000 {unset ::_twait}
+	vwait ::_twait
     event generate {} <Motion> -warp 1 -x 1 -y 1
     update idletasks ; # DoWarp is an idle callback
     after 50         ; # Win specific - wait for SendInput to be executed


4. Moving all (possibly interfering) toplevels out of the way that are not specifically used for test 35.1, with the following change, does NOT make the test pass:

--- bind.test.orig	2020-01-25 14:06:51.000000000 +0100
+++ bind.test.new	2020-02-01 11:16:04.000000000 +0100
@@ -6690,6 +6690,12 @@
 } -result {0 0 ok ok}
 
 test bind-35.1 {pointer warp with grab on master, bug [e3888d5820]} -setup {
+	# puts [winfo children .]
+	wm geometry .t +600+200
+	wm geometry . +600+400
+	set ::_twait 1
+	after 2000 {unset ::_twait}
+	vwait ::_twait
     event generate {} <Motion> -warp 1 -x 1 -y 1
     update idletasks ; # DoWarp is an idle callback
     after 50         ; # Win specific - wait for SendInput to be executed


5. Turning off the hot spot feature entirely (in the KDE system configuration under "Screen Edges") does NOT make the test pass.


6. The following change to the test script DOES make the test pass:

--- bind.test.orig	2020-01-25 14:06:51.000000000 +0100
+++ bind.test.new	2020-02-01 10:59:29.000000000 +0100
@@ -6703,7 +6703,7 @@
 } -body {
     grab .top  ; # this will queue events
     update
-    after 50
+    after 200
     update
     event generate .top.l <Motion> -warp 1 -x 10 -y 10
     update idletasks ; after 50

===//===

fvogel added on 2020-01-30 19:36:23:

Yes I'm seeing the hot spots on Debian 9 with KDE.

Ican see an interference of the upper-left hot spot corner with one test (bind-34.3) and have committed a patch a few days ago to let it succeed even when hot spots are present in the system. On this Debian 9 / KDE system the upper left hot spot area has size 1x1 pixel. See [21ee34adde] for details (I have put comments there explaining the story).

In bind-35.1 instead of moving the pointer to (0,0) I'm moving it to (50,50) now, which is enough to avoid entering a hot spot area while still moving the pointer out of the window (its top left corner is at (300,300)) as needed. This was my most recent commit ([4e5c19528b]) in the bugfix branch.


erikleunissen added on 2020-01-30 17:31:28:
Just a note regarding the process/progress:

While carrying out the effect of changing the warp coordinates from [1,1] to [10,10], I got an unexpected result, which may provide a different insight about the (interfering) role that the KDE hot spot plays in the test (or Tk test suite). I need ample time to formulate precisely, while doing some more experimentation / fleshing out at the same time.

One question for now though:

Did you also experience the interference of hitting the hot spot on KDE/Plasma? Perhaps with the Debian 9 system you mentioned (I don't know what desktops Debian provides).

fvogel added on 2020-01-27 21:13:36:

Thank you for your testing, I appreciate that.

So the results you're providing show that the fix works on all platforms you tried, except when hot spots are featured by the WM on some Linux flavours. Great!

I didn't mention it, sorry, but for me bind-35.1 now passes on all platforms I have access to (Debian 9, Win Vista, MacOS Catalina).

To improve robustness further, I have committed in [4e5c1952] a modified version of bind-35.1 that should avoid any interferences in bind-35.1 with potentially present hot spots. Could you please confirm?


erikleunissen added on 2020-01-27 10:44:04:
I tested the fix as follows:

Tk versions (compiled with the instrumented tkGrab.c):
- tk before fix, checkin 6c21b2cc
- tk after fix, checkin 3bf789ff

Platforms:
- Linux, KDE/Plasma
- Linux, IceWm
- Linux TWM
- Windows 7

For each combination of Tk version and platform, I ran (using tktest from the build directory):

    tktest(.exe) all.tcl -geometry +0+0 -file bind.test -match bind-35.1 -verbose beps > $outFile

Thus, each output file holds the test result as well as the corresponding printf() output of the instrumented tkGrab.c. Some lines have been prepended to this output for the purpose of system indentifcation.

Please see the results in "test_results_fix_3bf789ff.zip" attached.

Note that the test on KDE/Plasma fails where it should pass, most likely because the test triggers the hotspot in the top-left corner of the screen. Although I actually observed the follow-up screen action of this triggering only with 6c21b2cc (before fix), and not with 3bf789ff (after fix), I have no confidence in the results of both these test runs. The other window managers on Linux (IceWM and TWM) don't have such a hotspot and cannot exhibit such interference.

Also, I did a manual inspection of various Tk programs using grabs (on Linux and windows) and none displayed any peculiarities with tk-3bf789ff (i.e. including the fix).

erikleunissen added on 2020-01-25 16:56:44:
And a preliminary remark regarding:

"Reversing the order of statements in HandleEventGenerate() (see [3bf789ff]) lets TkPointerEvent() be called with the correct order of events and this changes the failure of bind-35.1 into a pass. This is true on all platforms, especially on the mac where the useless generation of an NSMouseMoved event when warping the mouse pointer can therefore be removed (see [2d8804d0])."

This seems to be a very good sign. Like a tent, folding neatly after having removed a useless dimension, spanned up by unnecessary awkwardness in the original code.

erikleunissen added on 2020-01-25 16:49:30:
Great, and respect! The analysis to finally get to a conclusion regarding the order of statements must have been quite a challenge, or comprised a lot of work at least.

I'm going to inspect the fix, as you requested, in the coming days.

fvogel added on 2020-01-25 14:33:26:

After careful analysis I committed the fix [3bf789ff] making bind-35.1 pass on both Windows and Linux.

The order of statements here matters after all. Scheduling an idle warping after Tk_HandleEvent() is called is OK as a principle, but the issue then is the detailed order of events that this order of code statements triggers. It appears that this order of events does not let TkPointerEvent() to set dispPtr->serverWinPtr (in a EnterNotify or LeaveNotify event) before the MotionNotify event will use that value to decide whether TkChangeEventWindow() will be called or not, which governs failure or success of bind-35.1.

Reversing the order of statements in HandleEventGenerate() (see [3bf789ff]) lets TkPointerEvent() be called with the correct order of events and this changes the failure of bind-35.1 into a pass. This is true on all platforms, especially on the mac where the useless generation of an NSMouseMoved event when warping the mouse pointer can therefore be removed (see [2d8804d0]).

To support my above analysis and fix, I'm attaching an instrumented tkGrab.c, and the corresponding debug output when running bind-35.1 before and after the fix.

Finally, the fix doesn't trigger any new failure (modulo the usual few semi-unstable tests) with the full test suite on any of the three platforms.

Could you please test the fix extensively with your complicated grabbing application and report your findings here? Thanks!


erikleunissen added on 2020-01-19 17:35:57:
I confirm that with the additional [update] in "inspect-multirun.tcl", placed as you indicated, the new bind-35.1 test (in [6c21b2cc]), now fails as expected on all occasions/platforms that I used for my previous inspections.

B.t.w. I am well aware of the very tentative and preliminary nature of the patch posted earlier, and it's corresponding value as a "fix".

Thanks for your clear judgement.

fvogel added on 2020-01-19 14:54:17:

Thanks for your test scripts and results (the attached 'test-script_inspection.zip'). I have analyzed all this matter and have reached the following conclusions:

  1. 'test-script_inspection.zip' with unpatched code
(specifically: at [fc2dd329]):
  • The expectation is that bind-35.1 always fails (since this is the unpatched codebase case), whatever the platform, whatever the number of consecutive wish launched, and whatever the number of bind-35.1 test runs inside the same wish.
  • test-fvogel.tcl: when bind-35.1 unexpectedly passes, which happens quite often, it's always at the first run of the test in a given wish session. Some wish sessions show their first run of bind-35.1 PASS, some others show all bind-35.1 FAIL as expected. It's not always the first wish session that shows the first bind-35.1 run unexpectedly pass, it can happen in a later wish session. Analysis of this issue reveals that your way of testing contains a race condition (let's call it race condition #1) in 'inspect-multirun.tcl': wish gets launched, and the test is run right after wish startup, without ensuring the event loop has mapped the '.' window, which is the mandatory condition for the first statement of bind-35.1, i.e. 'event generate {} <Motion> -warp 1 -x 1 -y 1' to really warp the pointer. The next executions of bind-35.1 in the same wish do not suffer from the same issue because the first run of bind-35.1 will run the event loop several times, thus servicing the queued events there as requested for warping to actually happen. Therefore an 'update' statement is missing in 'inspect-multirun.tcl' before the while loop. When this 'update' is added, the expected result described above (i.e. bind-35.1 always fails) is always obtained. Well, almost always! There are still very rare cases when the first run of bind-35.1 unexpectedly passes, and analysis of those cases shows that they pass because warping on the grabbed window unexpectedly works (i.e. (x1,y1) get the values we would expect when the test will PASS and the bug be fixed), which happens because the grab being set just before seems to have not yet been fully in effect. Again a race condition (race condition #2). Adding yet another 'update', in bind-35.1 this time, fixes this. This is [6c21b2cc]. Now I cannot make the test unexpectedly pass anymore, with either Windows or Linux.
  • test-fvogel-modified.tcl: now that the behavior with test-fvogel is clear, the results you observe with test-fvogel-modified.tcl can easily be explained. The added 'twait 1000' allows for the '.' window to be mapped, thus fixing race condition #1 above. Race condition #2 still exists but was not observed in your tests because it rarely happens.
  • test-elns.tcl: more or less the same explanation holds: the first race condition above is avoided thanks to extensive 'wm deiconify' and 'twait 1000' gymnastics, compensating for missing 'update' in the test setup 'inspect-multirun.tcl'.
  • On either Windows or Linux I never see the test unexpectedly pass when run by the bind.test test suite. Not even at [fc2dd329], that is without the 'update' added in [6c21b2cc]. Again, same reason regarding race condition #1: the '.' window was mapped a long time before the test suite reaches bind-35.1. Race condition #2 never happens for me in the test suite, as far as I could see it triggers only with your test setup, nevertheless I have committed (in [6c21b2cc]) the added 'update' in bind-35.1 for the sake of added test robustness.
  • So in a nutshell your testing setup with test-fvogel*.tcl triggered false PASSes because of incorrect testing setup. The '.' window must be mapped for the test to work correctly, which was not always the case due to race conditions in 'inspect-multirun.tcl'.

  1. Patched code:
  • Remember I didn't identify yet what the correct patch is. Therefore I was not yet interested at tests results with such (unknown) patch. I needed time to figure out why your testing setup 'test-script_inspection.zip' failed and mine never does. Now that this is clear and since we have a reliable bind-35.1 test script in the test suite I can again devote my time to actually fixing the bug. Just note that since it's not clear what the correct patch is, tests results at this stage with whatever "patch" we discussed before are simply irrelevant.


erikleunissen added on 2020-01-13 20:52:23:
I've done more experimenting with the latest test script bind-35.1.

My initial interest was to detect the effect of possibly varying initial conditions (e.g. position of the mouse pointer).

The experiment is governed by two scripts which have been attached. Essential to the experimental procedure is the consecutive execution of tests. This aspect of consecutive execution is dual in nature:
- as invocation of consecutive wish/tclsh executables (os processes), governed by "inspect-master"
- as consecutive tests within a single run of the test harness, governed by "inspect-multirun"

The script "inspect-master" is responsible for a reproducible test script invocation (twice in succession as configured now). The script "inspect-multirun" carries out the test (again multiple runs) inside the tcltest harness. "inspect-master" invokes sub-processes that in turn source "inspect-multi-run"

Attached you find the experimenting scripts and the test-scripts inspected by them.

Also attached, you find the output for several inspection runs on my systems. From these output files you can see that the latest test script in the bug-fix branch of Fossil (attached as "test-fvogel.tcl") still doesn't fail in all unpatched circumstances on my systems. Also, on my Linux system, 3 out of 4 times the test script doesn't pass for patched code.

I've added a modified test script (test-fvogel-modified) and its inspection results. On my linux system, this scripts fails in unpatched code where the unmodified script passes. However on my Windows 7 system, it doesn't perform better overall.Nevertheless, this might give you some clue about the reason for it not failing where it should (on Linux).

Also I've added a slightly adjusted copy of my original test script (test-elns.tcl), which behaves as expected, but which is still overly safe in allowing time for screen updates (but not as exceedingly as the original). I did this to have a double-check and reference case for actually having tested patched/unpatched code as intended.


Additional particulars
======================

I've seen an anomaly when warping the mouse pointer to the upper-left corner of the screen (0,0) or (1,1). This is specific to my Linux desktop KDE, which has a desktop widget fixed at that position. This widget wakes up as soon as the mouse pointer is over that position and deiconifies all windows of any program on the desktop, thus possibly spoiling test runs. Maybe it's wise to avoid that position (allowing a certain margin of inaccuracy). In fact, this remark concerns all corners and borders of the screen, since that's where desktops attach their specific functionality (docks, taskbar, ...).

=*=

erikleunissen added on 2020-01-13 20:51:58:
Although I understand that the two approaches that you mention are quite (entirely?) orthogonal w.r.t. each other, I don't (yet have the background to) understand the essential difference. For now I rely on your insight for this, and for the choice involved.

fvogel added on 2020-01-13 19:10:58:

I think that we should either tweak the grab machinery in case of pointer warping (your original fix proposal), or make each platform generate the required events that will let that machinery work unchanged.

It seems the latter was done on macOS. Indeed, things work on macOS because a specific event is being send on this platform by TkpWarpPointer(), which Windows does not generate. Removing this event from TkpWarpPointer() in tkMaxOSXMouseEvent.c makes the newly added test bind-35.1 fail on the mac.


fvogel added on 2020-01-11 14:21:45:

Test bind-35.1 now made more robust on Windows, see [98ff7da2].

In addition, [fc2dd329] makes bind-35.1 now fail on Linux and not only on Windows. Previously the warp prologue was in fact preventing the test from failing on Linux as expected at this stage (we didn't commit a fix).

So the test now reliably fails for me on both Windows Vista and Linux Debian 8 with KDE.

And it passes on macOS, also as expected since there is no issue we could see on macOS.

I'm running the test script the following (standard) way:

Linux (or macOS):

  make test TESTFLAGS="-file bind.test -match bind-35.1 -verbose beps"

Windows (with MSVC):

  nmake -f makefile.vc test TESTFLAGS="-file bind.test -match bind-35.1 -verbose beps"

If you still see instabilities in test results you will need to be much more specific in describing how you run the test.


erikleunissen added on 2020-01-09 18:02:12:
I see now that that the last line of my post at 2020-01-08 19:37:34 could be misunderstood. Where I wrote:

    "B.2 Subsequent runs when sourced from tkcon:"

I meant:

    "B.2 Consecutive runs when sourced from tkcon:"

erikleunissen added on 2020-01-08 21:25:31:
Ah! Thanks very much for the explanation. I overlooked the returning of functions before reaching the breakpoint.

Thanks again.

fvogel added on 2020-01-08 20:33:23:

The call stack shown at a breakpoint stop shows, well, the call stack, that is the list of nested function calls that brought the program at that breakpointed position. If the program calls a function foo() before that breakpointed position and that call returns the program flow goes on, the next instruction is called and the program finally reaches the breakpointed position. The back trace at this place does not show the call to foo() because this call returned already. The call stack is the list of nested calls that led to the breakpointed position in the code flow, it's not the list of all functions called in the past before reaching the breakpointed position.

Now imagine a platform-specific call was made and returns before the breakpointed position. This call could change, say, the value of some fields of dispPtr in a platform specific way. You cannot see this just from inspection of a call stack at a breakpointed position past the return point of the platform-specific function. That's why comparison of call stacks between platforms at a given position is pointless.

Example:

void funA(ptr) {
  funB(ptr);
  ...many other things...
  prt->field2 = 0x12344321; // set a breakpoint stop here
}
void funB(ptr) {
  ptr->field1 = 0xdeadbeef;
}
void main() {
  funA(ptr)
}

When the debugger stops at the breakpointed line, the simple inspection of the call stack will not show you the call to funB(). And the difference in how ptr->field1 may have been changed in a platform specific way by funB() (imagine funB() has a platform specific implementation) would have been missed.


erikleunissen added on 2020-01-08 19:37:34:
W.r.t. your analysis of TODO items A. and B.:

A. Only for my learning process / understanding:

You report a call to Tk_GetRootCoords() in between Tk_HandleEvent() and TkPointerEvent(). That call doesn't show up in the back trace that I generated (and you already mentioned previously that a back trace would be insufficient to detect that).

Is that because the call to TkPointerEvent() is a result of the scheduling as an idle callback (of DoWarp in Tk_HandleEvent)? 

(If it's not, then the real reason is probably involved to explain, and I don't mean to use this ticket machinery for the purpose of educating me.
If you do care to explain it to me and it's involved then you may consider throwing me a few hints by email)

B. OK, I'm happy that there is a valid explanation.

erikleunissen added on 2020-01-08 18:53:20:
While experimenting around, I noticed that the test script (bind-35.1) doesn't always fail on various Tk installations that have the issue (i.e. without the proposed patch). 

A. Windows 7
============
When source'ing several runs in sequence from tkcon (for both the following two installations):
- 8.6.9 ActiveState (unpatched)
- 8.6.10 Mingw (unpatched)

F(ail) and P(ass), but I think I found a pattern:
F if . (root window) is mapped at start of the test script execution
P if . is not mapped at start of the test script execution


B. Linux KDE plasma
===================

B.1 When executing an executable script:
8.5.19 (unpatched)	always P
8.6.10 (unpatched)	always P

B.2 Subsequent runs when sourced from tkcon: sometimes Pass sometimes Fail, no relation found with root window being mapped as under Windows 7.

fvogel added on 2020-01-08 07:43:15:
A. Running step by step from HandleEventGenerate() and looking for any platform-specific function called at any point until TkPointerEvent() gets called:

I did it on Windows and have found that indeed Tk_GetRootCoords gets called before that point, and it's a platform-specific function. However it is used to find the x,y root coordinates from the given -x,-y so it is unlinked to the issue. Apart from that one I didn't see a platform-specific function being called, which is what I expected, but I wanted to be sure.

B. Influence of the warp prologue

On Windows, I have checked now that with or without warp prologue, in TkPointerEvent we have dispPtr->serverWinPtr = 0x0 (NULL). This is consistent with the already discovered fact the warpPrologue has no effect on the behavior on Win.

I'm now thinking that the platform differences we see are not due to platform-specific code being called before the point of interest, but rather due to differences in events being sent by the WM depending on the platform. dispPtr->serverWinPtr is only set in TkPointerEvent, in case Tk receives an <Enter> or <Leave> event for a window. Apparently Windows does not send the same events as Linux, otherwise we would observe the same NULL/NON NULL value of dispPtr->serverWinPtr on both platforms.

Platform-specific differences in events satisfies my need for understanding why platforms behave differently.

We could perhaps instrument TkPointerEvent and compare traces between platforms to be sure.

C. Code order in HandleEventGenerate

Later.

fvogel added on 2020-01-06 21:41:05:
About the items on our todo list (A. B. and C.), I'll look at this later. Just a note however regarding A.: comparing backtraces at a specific breakpointed position is not good enough. It does not mean that platform-specific code was not executed before that point, and this could make the difference. We need to run step by step from HandleEventGenerate() and look for any platform-specific function called at any point until TkPointerEvent() gets called. I can do this easily, on either Windows or Linux or macOS.

So I will dive into all three A., B. (you show interesting things that could perhaps lead to the right fix) and C.

fvogel added on 2020-01-06 21:33:13:

I have now added a modified version of the test case you have proposed. See bind-35.1 in [200baaa88b]. This test is currently failing as expected (to demonstrate the issue).

This test passes with our proposed fix (which I don't commit yet because of the analysis in progress about the root cause of the issue). It should also pass with the "correct" fix (which is yet to be determined).


fvogel added on 2020-01-06 20:40:18:

I have opened branch bug-e3888d5820 to collect tentative fixes and other attempts related to this ticket.

First commit in that branch is the reordering of the Tcl_DoWhenIdle(DoWarp, dispPtr); snippet. We can do it, it does not change anything in how things work and it adds clarity in the source code. However I didn't add more comments there: the call to function Tcl_DoWhenIdle() is already very clear.

More to follow on the rest of your posts.


erikleunissen added on 2020-01-06 19:08:19:
Analysis of program flow w.r.t. the effect of the warp prologue on Linux.

The effect occurs in tkGrab.c at line 795 and following. Please see there for the variables mentioned below.

A. Without warp prologue
========================
outsideGrabTree = 0
dispPtr->serverWinPtr = 0x0 (NULL)

This leads to:
- winPtr2 being set to the grab window
- (winPtr2 != winPtr) being TRUE and
- TkChangeEventWindow() being CALLED.


B. With warp prologue (pointer warp is insensitive to issue on linux)
=====================================================================
outsideGrabTree = 0
dispPtr->serverWinPtr = 0x22b9920

This leads to:
- winPtr2 NOT being set to the grab window
- (winPtr2 != winPtr) being NOT true and
- TkChangeEventWindow() being SKIPPED
(and the program behaves as if there were no grab).


Conclusion
==========
Case B. is insensitive to the issue (on linux) because TkChangeEventWindow() is not being called, but for a reason that takes effect past line 777 in tkGrab.c.
Anything that takes place after that line in case of a pointer warp, at whatever platform, is entirely irrelevant to the issue of this ticket.

So, case B. is a red herring, but good to have checked it.

=*=

erikleunissen added on 2020-01-06 16:00:10:
I added a (preliminary) test script that fails if the issue shows up.

Two points:
1. the test script adds accuracy in that it ensures that the initial position of  the mouse pointer always is outside the toplevel window that we use for testing. This is necessary to prevent a possible interference with what we see with the warp prologue on Linux.

2. I'm unsure abut the desired accuracy for the test result (please see the comment in the script).

erikleunissen added on 2020-01-06 15:18:05:
Some updates:

- On my linux system, tk8.6.10 is back up its feet[*]. From now onwards I will be working with 8.6.10 again, so we are synchronized in that respect. Now I can confirm that your fix against 8.6.10 resolves the issue on Linux too.

	[*] There was something wrong with my installed combination of Tcl8.6.10 and Tk8.6.10. I rebuilt the two of them in one swoop (with the same configuration) as I normally do, installed them anew and the problem of the segfaults was gone.

- Your explanation of the course of program flow corresponds exactly to my experience when I used gdb to trace back to the origin of the issue in TkPointerEvent(), see also the back trace under A below.

- My manual testing was targetted only at grabs and mouse behaviour, and while I checked the handling of the GUI I didn't notice anything odd otherwise. So it's an indicator yes, but not a structured inspection for regressions.

- W.r.t. the placement of this code fragment:

	if (!(dispPtr->flags & TK_DISPLAY_IN_WARP)) {
	    Tcl_DoWhenIdle(DoWarp, dispPtr);
	    dispPtr->flags |= TK_DISPLAY_IN_WARP;
	}

I do understand the scheduling nature of it, and also that it doesn't make a difference for the program's execution path further on. However, it does make a difference for code readability and maintainability. As soon as one starts wondering why some statements were issued after this code fragment, there is a lack of maintainability, especially for those less experienced with the event loop. Only after you read the lines in the context of the code fragment, you understand that the order doesn't make a difference for the pointer warp. But in my opinion you  shouldn't have to do that. For that reason only, I would move the code fragment to the position as in my patch for 8.6.9, and I also would at the very least document the scheduling nature of the statements with a short comment.

I'll let you decide, and I can accept any choice.

- I suggest we handle your points 7a, b and c as a TODO list. Please see below:

==========================

A. "We should see the issue on the mac if the problem is on the generic code, or we should understand why the mac does not exhibit the problem + [So far only generic code was called I think (TODO: confirm this firmly).]"

I suggest to:
- generate back traces leading up to TkPointerEvent() in an unpatched program for all 3 platforms. (set a breakpoint on line 777 in tkGrab.c and do: bt).
- compare them for differences
- check the C functions and C source files for each of them
- decide about generic code and possibly different pathways for different platforms.

I added the back trace for Linux as "bt-linux-8.6.10-unpatched.txt". As you can see from it, all invoked C functions are in source files that reside in ./generic except for the program entry at main and Tk_MainEx(), which both reside in source fils in ./unix.

As for Windows: could you supply that? My Windows installation is not prepared for this kind of debugging, and if I were to manage to do that at all, I know from my experience some years ago it'll take me half a day to do so.

And who can for MacOSX? (note that the path of execution may not reach line 777 in tkGrab.c at all, depending on before or after Marc's changes in the grab area).


B. "We need to understand the story with this warpPrologue. This is very obscure to me, perhaps I should see this with my own eyes on Linux." 

It's remarkable and miraculous indeed. I will investigate the code path for this on Linux.

Do you have a Linux box yourself? Otherwise I could see whether I can produce a video by recording my desktop for you.


C. We should better understand the consequences/reasons for the order of the code in HandleEventGenerate()

This requires a better understanding of the implementation of the event loop at the C level, than I have now. That could prove to take me quite some time to study.

Therefore, I will continue with B ...
=*=

fvogel added on 2020-01-05 20:53:36:

I will try to answer each point in turn here:

1. and 4. Thanks for the clearing. For sure the crash deserves a separate ticket. You seem to say it's fully repeatable, please post the commands you use to configure and build, your test script, the action you take with this script running (if any), and a backtrace with the ticket. First cause that comes to mind is unfortunately TIP #532. If you can debug this, it's the very best because nobody has a crash of vanilla 8.6.10 when running the test suite to the best of my knowledge.

2. I can test on the mac (many thanks to Brad, once more!), here are the results on Catalina with the current core-8-6-branch: your "exercise.tcl" test script warps the mouse at x=10,y=10 in the yellow label in any case (that is, whatever the value of doGrab and warpPrologue are). So there is no issue on the mac.

3. Yes I can see the issue under Windows (Vista at least): your "exercise.tcl" test script warps the mouse at x=10,y=10 in the yellow label if doWarp is 0 but not if it is 1 (that is, whatever the value of warpPrologue is). Since then you also saw the problem on Windows it so we're consistent here.

3a. I understand you see the issue under Linux. I didn't try on this platform so far. What is apparently specific to Linux is the effect of the warp prologue !?

3b. Since it works without any code change on the mac then the logical conclusion is that the generic code is correct. This is puzzling at first, but I think it can be attributed to the more or less recent fixes by Marc Culler in the grab area for the mac. This assertion will need confirmation though, by fossil updating to before Marc's changes and testing your script on the mac (expected result: warping in the yellow label should fail when doWarp is 1).

3c. Be it on Windows or on the mac I see the same behavior whatever the warpPrologue value is.

5. About the fix itself, here is what is happening and why the fix works:

  • event generate $w <Motion> -warp... schedules mouse pointer warping at idle time. It does so in TkHandleEventGenerate, after setting the warpWindow, and before setting the x and y coordinates to where the pointer should warp. It works even if x and y are set after the scheduling because, precisely, this is a scheduling: warping will happen at idle time and at this time x and y will have been set by TkHandleEventGenerate. Moving the Tcl_DoWhenIdle(DoWarp, dispPtr); code block down is more logically intuitive but it does not matter if it's placed before setting of dispPtr->warpX/dispPtr->warpY. (this move does not have an effect IMO)
  • In TkHandleEventGenerate, the warpWindow is not the right window in which the user asked to warp but the window in which the events are redirected by the grab implementation. This redirection happens in TkChangeEventWindow, which is called by the call to Tk_HandleEvent a few lines above the idle warping scheduling. Tk_HandleEvent triggers the mouse event handlers (InvokeMouseHandlers) which call TkPointerEvent, which arranges for grab to work by deciding whether the event should be processed or not (to make grabs work as expected).
  • So bailing out of TkPointerEvent in case a warp is scheduled (i.e. if (dispPtr->flags & TK_DISPLAY_IN_WARP)) would work to allow the warping in the correct window because TkChangeEventWindow would not be called. However to achieve this we need to reorder the code in TkHandleEventGenerate so that the TK_DISPLAY_IN_WARP is already set when Tk_HandleEvent is called.
  • So far only generic code was called I think (TODO: confirm this firmly).
  • As a general matter, our both fixes may work but they may also be an incorrect fix. I'm not yet convinced this fix is correct.

5a. From this code analysis, the unmodified code should work if a -when option is added to the event generate:

  event generate .t.l <Motion> -warp 1 -x 10 -y 10 -when tail
And indeed it is the case (checked on Windows).

6. The test suite not showing new failures is a good indicator. On both Windows Vista and the mac Catalina I see zero failures. And your testing of your application with heavy grabs is another indicator of the correctness of the fix. Did this testing test for regressions as well? Any new behavior other than the expected change?

7. All that said, I'm uncomfortable because I don't think we grok the full story here.

a. We should see the issue on the mac if the problem is on the generic code, or we should understand why the mac does not exhibit the problem.
b. We need to understand the story with this warpPrologue. This is very obscure to me, perhaps I should see this with my own eyes on Linux.
c. We should better understand the consequences/reasons for the order of the code in TkHandleEventGenerate


erikleunissen added on 2020-01-05 16:46:56:
Also under Windows 7 no issues when manually testing local grabs in a rather involved GUI with the dll containing my fix.

erikleunissen added on 2020-01-05 16:33:39:
W.r.t.

3. About Windows: did you observe the issue under Windows? I didn't when running exercise.tcl on Windows 7 at the time when I filed this ticket. So I didn't bother to test a fix on that platform.

This is wrong. I seem to have trouble correctly remembering my own experiences of a few weeks ago; sorry. The issue did exist on Windows[*].

Therefore, I also tested my fix to 8.6.9. on Windows 7, and it solved the issue there too.
--
[*] Only the effect of a pointer warp prologue into the master widget - see my first original post - didn't show.

erikleunissen added on 2020-01-05 15:22:08:
And w.r.t. other tickets: when I filed this ticket, I already checked for similar tickets (search for the combination of warp and grab). I found none.

erikleunissen added on 2020-01-05 15:14:54:
The fact that my builds of Tk8.6.10 are irreliable is an awkward coincedence: I cannot directly test your fix against 8.6.10. Also, your patch is rejected when applied to tkBind.c in 8.6.9. (because that file has been slightly changed between 8.6.9 and 8.6.10.)

I did test my own patch against 8.6.9. Results:
- test suite: tests in test files grab.test and event.test ran without any failures.
- I manually inspected the behaviour of (only local) grabs in a rather involved GUI and found no issues.

erikleunissen added on 2020-01-05 13:47:10:
I contentedly agree with your remarks. 

1. This ticket has accumulated some noise for which I'm to blame. I'm trying to clear things up for you, but if it remains unclear, we should consider opening a new ticket and fill it with the parts of this ticket that remained relevant over the course of its development.

2. About Mac: though I have a macosx, I cannot test or inspect on this platform (I never got Tk running on it and I don't have the toolchain ready to build for it).

3. About Windows: did you observe the issue under Windows? I didn't when running exercise.tcl on Windows 7 at the time when I filed this ticket. So I didn't bother to test a fix on that platform.

4. About the segfaults: these are definitely unrelated to the issue of this ticket. It's my Tk8.6.10 that consistently crashes on Linux even when running the test suite after a pristine build. For further inspection of this ticket I reverted to Tk 8.6.9, which behaves reliably.

5. About the tentative fix that you posted:
- you were just one hour ahead of me. I attached the tentative fix that I produced in the meantime, and you'll see that it is almost identical to yours.
- I had the same reservations about the reordering of statements in tkBind.c. simply because I didn't understand the original order, especially w.r.t. the following code section:

	if (synch) {
	    Tk_HandleEvent(&event.general);
	} else {
	    Tk_QueueWindowEvent(&event.general, pos);
	}

In my fix, I went one step further than you did: if we're going to change the order of statements in tkBind.c anyway, then we could also question the location of the following code section relative to its context. I find that location confusing, but it doesn't directly relate to the issue of this ticket:

	if (!(dispPtr->flags & TK_DISPLAY_IN_WARP)) {
	    Tcl_DoWhenIdle(DoWarp, dispPtr);
	    dispPtr->flags |= TK_DISPLAY_IN_WARP;
	}

You can see the more logical location (as far as I'm concerned) in my patch, attached.

- I ran the Tk test suite under Linux with my fix, and it didn't produce more test failures than without the fix. That doesn't offer guarantees though, because the test suite exhibits more than 100 test failures. And even if it it had zero test failures, I'm unsure whether the test suite is complete w.r.t. grabs. What I can do additionally is to manually inspect the behaviour of a run a complicated GUI with grabs.

I'm very contented with our almost identical understanding of the issue so far.

fvogel added on 2020-01-05 11:41:58:

The following patch fixes the reported issue I think:

Index: generic/tkBind.c
==================================================================
--- generic/tkBind.c
+++ generic/tkBind.c
@@ -4351,21 +4351,10 @@
 	     */
 	    event.virtual.user_data = userDataObj;
 	    Tcl_IncrRefCount(userDataObj);
 	}
 
-	/*
-	 * Now we have constructed the event, inject it into the event handling
-	 * code.
-	 */
-
-	if (synch) {
-	    Tk_HandleEvent(&event.general);
-	} else {
-	    Tk_QueueWindowEvent(&event.general, pos);
-	}
-
 	/*
 	 * We only allow warping if the window is mapped.
 	 */
 
 	if (warp && Tk_IsMapped(tkwin)) {
@@ -4389,10 +4378,21 @@
 	    }
 	    dispPtr->warpMainwin = mainWin;
 	    dispPtr->warpX = event.general.xmotion.x;
 	    dispPtr->warpY = event.general.xmotion.y;
 	}
+
+	/*
+	 * Now we have constructed the event, inject it into the event handling
+	 * code.
+	 */
+
+	if (synch) {
+	    Tk_HandleEvent(&event.general);
+	} else {
+	    Tk_QueueWindowEvent(&event.general, pos);
+	}
     }
 
     Tcl_ResetResult(interp);
     return TCL_OK;
 }

Index: generic/tkGrab.c
==================================================================
--- generic/tkGrab.c
+++ generic/tkGrab.c
@@ -772,11 +772,11 @@
 	    }
 	}
 	return 1;
     }
 
-    if (!appGrabbed) {
+    if (!appGrabbed || (dispPtr->flags & TK_DISPLAY_IN_WARP)) {
 	return 1;
     }
 
     if (eventPtr->type == MotionNotify) {
 	/*

Warning: I didn't check this patch doesn't break the grab command, despite the test suite keeping being happy (on Windows at least). There must have been a reason for scheduling/dealing with the event _before_ scheduling the warping in HandleEventGenerate(). So this needs more work.

Also I'm a bit lost at the reported differences between Linux and Windows (and what about the mac?), and the crash story.

Finally, there are some open tickets on 'grab', does any of them ring a bell for you?


erikleunissen added on 2020-01-04 17:02:53:
Please disregard previous observation 2. The segfault is probably due to something else because now my wish8.6 always crashes, even if I only exit the app. Need to resolve that first here. Sorry for the noise.

Restoring severity and title ...

erikleunissen added on 2020-01-04 16:04:03:
More observations:

1. The file exercise.tcl sets the grab on "." instead of the master window ".t". This was confusing. It still demonstrated the issue, but it made the pointer get stuck  outside toplevel .t.
The file exercise.tcl has now been adjusted to set the grab on .t, and the result is that the mouse pointer gets stuck in the top left corner of the master window.

2. What I observed only now is that not only the mouse pointer gets stuck, but that the entire application "hangs". In order to terminate it, I have to send the process a signal from external. When adding an extra "exit" command to the script in exercise.tcl, the application segfaults. Quite some cherry on top of the already reported misbehaviour.
A full back trace of the dumped core file is attached. But if my analysis and suggestion for a fix in the previous post is correct, we won't need it because the segfault happens somewhere much later in the code.

The new file exercise is also attached, and I changed the title and the severity to reflect that we now also have a segfault.

erikleunissen added on 2020-01-04 11:45:53:
I've tracked this issue down to the location from where it originates, please see the code section below. This code section appears to not take into account the case that motion events originate from pointer warping.

I'm not yet entirely sure about the exact inappropriateness, but as far as my understanding reaches now, I believe that grabs are irrelevant to pointer warping, and they should be entirely ignored. If that's correct, then the fix ought to be applied to line 777, something like:

 777     if (!appGrabbed || motionEventOriginatesFromWarp) {
 778         return 1;
 779     }

B.t.w., for now the "motionEventOriginatesFromWarp" is an identifier that represents my logic, but for which I don't know how to implement it. 

If this approach is too rigourous, then maybe instead, a fix should be applied after line 781.

Erik.

=== Relevant code section In TkPointerEvent() in tkGrab.c, Tk 8.6.10 ===

 777     if (!appGrabbed) {
 778         return 1;
 779     }
 780 
 781     if (eventPtr->type == MotionNotify) {
 782         /*
 783          * When grabs are active, X reports motion events relative to the
 784          * window under the pointer. Instead, it should report the events
 785          * relative to the window the button went down in, if there is a
 786          * button down. Otherwise, if the pointer window is outside the
 787          * subtree of the grab window, the events should be reported relative
 788          * to the grab window. Otherwise, the event should be reported to the
 789          * pointer window.
 790          */
 791 
 792         winPtr2 = winPtr;
 793         if (dispPtr->buttonWinPtr != NULL) {
 794             winPtr2 = dispPtr->buttonWinPtr;
 795         } else if (outsideGrabTree || (dispPtr->serverWinPtr == NULL)) {
 796             winPtr2 = dispPtr->grabWinPtr;
 797         }
 798         if (winPtr2 != winPtr) {
 799             TkChangeEventWindow(eventPtr, winPtr2);
 800             Tk_QueueWindowEvent(eventPtr, TCL_QUEUE_HEAD);
 801             return 0;
 802         }

=== end of code section ===

erikleunissen added on 2019-11-10 12:48:35:
Just verified that the behaviour is identical for various window managers under Linux: KDE Plasma, IceWM and TWM.

Attachments: