Deep Reinforcement Learning (DRL) is a paradigm of artificial intelligence where an agent uses a neural network to learn which actions to take in a given environment. DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games. Numerous implementations of the state-of-the-art algorithms responsible for training these agents, like the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms, currently exist. However, studies assume implementations of the same algorithm to be interchangeable. In this paper, through a differential testing lens, we present the results of studying the extent of implementation discrepancies, their effect on the implementations’ performance, as well as their impact on the conclusions of prior studies. The outcome of our differential tests showed significant discrepancies between the tested algorithm implementations. In particular, out of the five PPO implementations tested on 56 games, three implementations achieved superhuman performance for 50% of their total trials while the other two implementations only achieved superhuman performance for less than 15% of their total trials. Furthermore, the performance among the high-performing PPO implementations was found to differ significantly in nine games. As part of a meticulous manual analysis of the implementations’ source code, we analyzed implementation discrepancies and determined that code-level inconsistencies primarily caused these discrepancies. Lastly, we replicated a study and showed that these implementation discrepancies were sufficient to flip experiment outcomes if left unaccounted for. Therefore, this calls for a shift in how implementations are being used. In addition, we encourage DRL libraries to avoid these discrepancies by either adopting the differential testing methodology proposed in this paper or explicitly documenting code-level inconsistencies.