im trying to implement this algorithm
Fitted Q Iteration
They are two Questions:

Collect D samples. Do I let the conicidence decide which action A_i the Agent takes?

How do I implement argmin theta? Do I need to use gradient descent to get the it? Because tensorflow offers tf.argmin() which just takes the lowest value in a tensor which doesnt make sense here. So i suggest I have to figure theta’ out by myself?
Source: Python Questions