A while ago I had a strange memory leak issue involving a fairly complex use of the C++ wrapper class cl::Event.
I wrote the code as (pseudocode):

Code :
cl::Event current_event;
std::vector<cl::Event> prerequisites;
// Some upload events happen and fill prerequisites...
command_queue.enqueueNDRangeKernel(kernel1, ..., &prerequisites, &current_event);
prerequisites.clear();
prerequisites.push_back(current_event);
command_queue.enqueueNDRangeKernel(kernel2, ..., &prerequisites, &current_event);
prerequisites.clear();
prerequisites.push_back(current_event);
command_queue.enqueueNDRangeKernel(kernel3, ..., NULL, &current_event);
prerequisites.push_back(current_event);
command_queue.enqueueNDRangeKernel(kernel4, ..., &prerequisites, &current_event);
prerequisites.clear();
current_event.wait();
// Download data from device and finish...

Although this reads as (to my mind) a sensible way to write in C++ that:

Code :
upload -> kernel1 -> kernel2 } -> kernel4 -> download
                     kernel3 }

This leaks horribly, because although this seems like it should work, the C++ cl::Events is a very thin wrapper around the C code equivalent. Because there is no reference counting, the destructor, when it is called too early doesn't do anything and can't return an error as it attempts to destroy an event that hasn't happened yet on the GPU. The copies of the cl::Event stored in prerequisites still work fine, as the first attempt to destroy the event in the driver fails. The second attempt when prerequisites is cleared also fails, as the queue is being generated and the event still hasn't happened yet on the GPU. In this way, although the queue is fine and the code functions, the events pile up in the driver, slowing everything to a crawl.

While I understand why this is how things are architected, I think there should be more noise in the documentation about such silent errors in the cl::Event destructor (possibly including the destructors for other object wrappers), and the solution to the leak that I found should also exist. Changing the above code to:

Code :
std::vector<cl::Event> all_events;
std::vector<cl::Event> prerequisites;
// Some upload events happen and fill all_events and then prerequisites...
all_events.push_back(cl::Event());
command_queue.enqueueNDRangeKernel(kernel1, ..., &prerequisites, &(all_events.at(all_events.size() - 1)));
prerequisites.clear();
prerequisites.push_back(all_events.at(all_events.size() - 1));
all_events.push_back(cl::Event());
command_queue.enqueueNDRangeKernel(kernel2, ..., &prerequisites, &(all_events.at(all_events.size() - 1)));
prerequisites.clear();
prerequisites.push_back(all_events.at(all_events.size() - 1));
all_events.push_back(cl::Event());
command_queue.enqueueNDRangeKernel(kernel3, ..., NULL, &(all_events.at(all_events.size() - 1)));
prerequisites.push_back(all_events.at(all_events.size() - 1));
all_events.push_back(cl::Event());
command_queue.enqueueNDRangeKernel(kernel4, ..., &prerequisites, &(all_events.at(all_events.size() - 1)));
prerequisites.clear();
cl::Event::waitForEvents(all_events);
// Download data from device and finish...

Stops all the leaks, as the destructors are all called after the events have completed on the GPU.