The client adjusts the simulation frame by ping, so inputs produced by two clients at the same real-world time will be for the same simulation frame (each input sent carries the simulation frame stamp).
Any other things we should consider, like any solution to automatically adjust the input delay frames based on the network conditions or do you think that’s not needed when doing the simulation frame adjustments as above?
Adjusting the simulation frame by ping improves the situation, but it can’t fix ping. If a time required for input to travel from one client to the other is greater than the input delay, then there’s still going to be prediction.
As described in the previous thread, the decision to automatically adjust the delay based on ping is a tough one. Some fighting games have done it, but that led to player frustration, because input timing is critical in those games, and for some players it’s better to have long but predictable delay rather than possibly shorter but unpredictable. I recall one game that simply gave players the option to enable variable delay so it was up to the user to decide.
There’s another caveat with variable delay, and that’s increasing and shrinking the input stream. In normal circumstances player produces a steady stream of inputs. However, when you increase the delay, you’re effectively injecting an input into the stream. For example, if it’s frame 10, the delay is 2, and I’ve just produced input for frame 12, if in this very frame we bump the delay to 3, next frame (11) I’ll be producing an input for frame 14. That means we have to “conjure” an input for frame 13. You could do something as simple as copying the input from frame 12, but whether this is a viable option or not depends on the game. If that input contained “shoot” action, player could shoot twice, despite pressing the shoot button only once. Similar problem arises in the opposite direction. If we reduce the delay from 3 back to 2, on frame 12 we would again produce the input for frame 14. Should we in this case simply ignore the second input? Merge it with the previous? Use only the new one? Tough questions 
All in all, I’d say this is more of a design decision, than a technical one. Keep in mind that the visuals can be detached from the underlying game logic, which means that rollback and resimulation do not have to result in a “teleport” - in case of small errors you could smoothly correct the visuals over time, similarly to how it’s done in FPS reconciliation. If you smooth out the resimulation, and solve the problem of variable input stream from above, then the variable delay is likely to give the best visual results.
If however this brings too many complications, then I’d probably do something in between, which is measure the ping at the start of the session, and set the delay based on the ping for the whole duration of the session. It’s rather rare for the ping to increase for longer periods. Just make sure that the ping is measured over “some” period of time (not just one sample) and any spikes are discarded. Also remember to set the delay for both players based on the sum of their latencies (one way) - after all, the time for input to reach one or the other is based on the pings of both of them.
Hope this helps!