Skip to content

Releases: tzcnt/TooManyCooks

v1.6.1 - bugfixes

14 Jun 23:16

Choose a tag to compare

Fixes are mostly for new v1.6 features, with a couple tmc::channel fixes that go back to v1.0.

  • 🚩 major bug fix
  • 🐞 minor bug fix

In v1.0.2 / v1.1.2 / v1.2.2 / v1.3.3 / v1.4.1 / v1.5.1:

  • 🚩 channel bugfixes (#240)
  • 🐞 fix post_bulk() index comparison wraparound (#243)

Also in v1.6.1:

  • 🚩 fix possible lost wakeup in qu_mpsc_bounded::close() (#242)
  • 🚩 empty() now returns false on a closed-and-drained queue (#246)
  • 🚩 fix incorrect wake/wait counts on Linux in ex_cpu_st (#249)
  • 🐞 fix assert in qu_spsc_unbounded::try_reclaim_blocks() when BlockSize == 1 (#245)
  • 🐞 chase-lev cleanup (#248)

v1.6.0 - more queues and faster executors

07 Jun 02:31
8e2c40f

Choose a tag to compare

New Async Queues

This release includes 4 new async queues.

These queues all support suspending consumption; consumers can wait for data to become available with co_await q.pull() or poll for data with q.try_pull(). Bounded queues also support suspending production; producers can wait for a free slot in the queue with co_await q.push(value). These queues do not require the use of a hazard pointer/token like tmc::channel.

They also support purely zero-copy operation. All enqueue operations emplace-construct values directly in the queue storage. All dequeue operations return a zc_scope, which is a scoped reference to an object in the queue storage. You can choose to copy or move the object out of the queue from this scope, or to access it in-place. When the scope is released (before the next pull), the queue object is destroyed and the consumer pointer is advanced.

Enhancements

The following APIs will now wake waiting coroutines in FIFO order (previously LIFO):

  • tmc::auto_reset_event::set() / co_set()
  • tmc::mutex::unlock() / co_unlock()
  • tmc::semaphore::release() / co_release()

Performance Improvements

  • tmc::ex_cpu external producers can safely skip waking if executor threads are running
  • tmc::ex_cpu internal work-stealing queue is now a Chase-Lev deque
  • tmc::ex_cpu_st uses a more efficient external ingest queue (a modified tmc::qu_mpsc_unbounded)
  • tmc::ex_cpu_st uses a more efficient internal stack
  • tmc::ex_braid is built on tmc::qu_mpsc_unbounded instead of tmc::channel

Breaking Changes

The build configuration TMC_WORK_ITEM=FUNC is no longer supported. If you are using this configuration, you will receive a compile error and be instructed to replace it with TMC_WORK_ITEM=FUNCORO instead. This is because the Chase-Lev deque requires its work item to be trivially copyable, which std::function is not.

v1.5.0 - the build system edition

10 May 20:15

Choose a tag to compare

Table of Contents

Build System Changes

New Features

Improvements

Build System Changes

Summary: all of the build system changes are backward-compatible in the short term, but you should take action to migrate your projects as directed to avoid breakage in future releases.

TMC becomes header-only

TooManyCooks has become a header-only library by default, with all function definitions inline.

The previously standard build mode which embeds definitions for non-template functions into the file with #define TMC_IMPL is now available as an optional configuration that can be specified by defining TMC_STANDALONE_COMPILATION. If you wish to build the library into a DLL, you can also define TMC_WINDOWS_DLL which will apply __declspec(dllexport) to the implementation file, and __declspec(dllimport) to all consuming files.

This change is backward-compatible with existing builds from pre-v1.5 projects that don't set TMC_STANDALONE_COMPILATION. Those builds will silently become header-only builds, and the #define TMC_IMPL file will silently become empty.

To complete the migration from v1.4 to v1.5, either remove the implementation macro file, or set TMC_STANDALONE_COMPILATION to restore its functionality.

tmc-asio is now included with TMC

The legacy tmc-asio repo has been marked as obsolete. tmc-asio is now vendored directly within the include/tmc/asio/ folder.

This change is backward-compatible with existing builds from pre-v1.5 projects that import the legacy tmc-asio repo, since there have been no changes to tmc-asio in this release. Your project may find tmc-asio headers from both locations, but the headers are identical so there will be no immediate impact.

To complete the migration from v1.4 to v1.5, remove the legacy tmc-asio repo reference from your project.

TMC offers a CMake INTERFACE target with configuration options

A CMake INTERFACE target is now available in the TooManyCooks library. This target exposes the preprocessor definition-based configurations ([documented here])(https://fleetcode.com/oss/tmc/docs/v1.5/build_flags.html) as CMake options. The mapping from options to definitions is in tmc-options.cmake.

This change is backwards-compatible with existing builds from pre-v1.5 projects that add the include/ directory and set configuration macros directly.

To complete the migration from v1.4 to v1.5 in a CMake consumer project, replace

target_include_directories(${target_name} PRIVATE ${TMC_INCLUDE_PATH})

with

target_link_libraries(${target_name} PRIVATE TooManyCooks::TooManyCooks)

and replace any manual compile-definition setting with setting the appropriate CMake options before importing TooManyCooks as documented in the examples repo. Note that the new CMake target sets TMC_USE_HWLOC=ON by default, so you will need to disable that option explicitly if you don't want hwloc integration.

If your project doesn't use CMake, no action is needed - you can continue to import the library and set the preprocessor configurations as before.

TMC + Conan integration

conanfile.py has been added for conan integration. The README explains the usage. The conan recipe exposes all of the same configuration options as the CMake target, but in a build-system agnostic manner.

TMC + vcpkg integration

A vcpkg ports folder has been added. The README explains the usage.

New Features

Added PostRunHook to ex_cpu and ex_cpu_st

Thanks to @fede-vaccaro for this change! These executors now support calling set_thread_post_run_hook() to configure a function that will run after a thread has checked all of its queues and failed to find work, before it enters the spinning/sleeping phase. One usage of this feature is to allow users to add custom work queues or poll for I/O in a user-configurable manner.

Added documentation entries for this method on ex_cpu and ex_cpu_st.

Improvements

Miscellaneous

  • Remove tmc::task::operator bool entirely in #211
  • make fork_group and spawn_group [[nodiscard]] in #212
  • Added a requirement on task::return_value in #215 (Thanks to @DrizztDoUrden for this change!)
  • capture *_group continuation_executor at the await point in #216
  • tmc::detail::qu_mpsc becomes hazard pointer-free in #221

tmc::channel close() wakes waiting consumers

  • channel: wake waiting consumers in close() instead of drain() in #220

This makes close() the preferred method to terminate a channel, and marks drain() + drain_wait() as deprecated. If you need to wait for all consumers of a channel to complete, it is recommended to call close() and then use a different method of tracking such as a fork_group or barrier.

Updated the documentation entries for chan_tok::close() and chan_tok::drain().

v1.4.0 - the toolbox edition

02 Feb 03:36

Choose a tag to compare

This release includes several features to enhance the flexibility of the library.

Pre-Announcement: async stack traces

I'm going on a side quest to get async stack traces working in the debugger. This is still an area of active research, and was only made possible recently by the addition of ScriptedFrameProviders in the LLDB 22 pre-release. You can follow the development and read more in the backtrace branch of tmc-examples. Currently, there is a "fully working" version of the debugger script, including IDE integration, for at least one configuration (VS Code + LLDB DAP + LLDB 22 + Clang/GCC + x86). Contributions are welcome.

New example: pipeline.cpp

This shows a configurable generic parallel processing pipeline. Processing stage functions can be coroutines or regular functions. It makes use of the traits and zero-copy channel functionality introduced in v1.4. pipeline.cpp demonstrates the usage, and there are two generic header implementations: pipeline.hpp and pipeline_fifo.hpp.

(Docs Link) New header: tmc/traits.hpp

This header contains 2 logical groups of concepts / type traits:

  • traits to allow users to distinguish between values, awaitables, and packaged tasks
  • traits to allow users to distinguish between tmc::task and other kinds of functors

This enables users to write generic handlers for various scenarios, similar to those employed internally within the library. For example, here is a generic handler that can accept any value, awaitable, or packaged task:

template <typename T>
void handler(T&& t) {
  if constexpr (tmc::traits::is_awaitable<T>) {
    process(co_await std::forward<T>(t));
  } else if constexpr (tmc::traits::is_callable<T>) {
    process(std::forward<T>(t)());
  } else {
    process(std::forward<T>(t));
  }
}

(Docs Link) New capability: tmc::channel zero-copy

A new awaitable function tmc::chan_tok::pull_zc() has been added. This behaves similarly to pull() but returns a scoped object that holds a reference to the storage slot directly in the channel. When the scoped object's lifetime ends, the referenced object is destroyed and the channel slot can be reused.

Additionally, the tmc::chan_tok::post() and tmc::chan_tok::push() functions have been rewritten to use forwarding construction rather than copy construction.

This combination of features make it possible to use tmc::channel in a completely zero-copy, zero-move manner (with a type that doesn't have a default constructor, copy constructor, or move constructor).

(Docs Link) New free function: tmc::spawn_clang()

This is the same as tmc::spawn() but it enables HALO so that you can customize the executor and priority without incurring an allocation. To facilitate this, it takes the executor and priority as optional arguments rather than using the fluent builder pattern.

(Docs Link) New compile option: TMC_NODISCARD_AWAIT

If the user defines TMC_NODISCARD_AWAIT then every await_resume() that returns non-void will also be marked [[nodiscard]].

This has to be done in the library since there's no way to specify it otherwise. Setting [[nodiscard]] on the return type doesn't work and neither does setting it on the coroutine task function declaration.

As a side note, all of the awaitables themselves are already nodiscard. This just adds the option to also make the result of the co_await expression also nodiscard.

New header: tmc/version.hpp

This defines the TMC_VERSION, TMC_VERSION_MAJOR, TMC_VERSION_MINOR, and TMC_VERSION_PATCH macros. This has also been backported to prior version branches and can be used to check for feature availability.

Breaking changes

tmc::coro_functor has been made trivially copyable/destructible. Instead of destroying any owned functor in its destructor, it instead destroys the owned functor after it is executed. This makes it a delegate that must be executed exactly once. Since this behavior is unusual and possibly unsuitable for general purpose consumption, it has been moved into the tmc::detail namespace.

This is the work item type used internally when TMC_WORK_ITEM=FUNCORO is defined. However, it's not directly required in any public API, so this is only a breaking change if you were using this type directly in your code. Since the namespace has changed, if you are affected by this, it will be a compile-time error.

The purpose of this change is to prepare for a future transition to a Chase-Lev deque which avoids reclamation synchronization when resizing by leaving alive older task buffers. This allows multiple redundant copies to point to the same functor, as long as only one of them is executed.

v1.3.2 - the bugfix backports edition

02 Feb 02:17

Choose a tag to compare

The objective of this release is to establish a versioning strategy that allows users to choose a particular minor version (e.g. v1.3) and receive patch fixes to that version without any potential regressions that might be introduced by new features. To this end:

  • a full review was completed on all files in the repository
  • fixes were applied to main
  • branches were created for all major versions released so far: v1.0, v1.1, v1.2, v1.3
  • fixes were backported to each branch where they apply
  • new tags v1.0.1, v1.1.1, v1.2.1, v1.3.2 were created which include the appropriate fixes

Users now have the option to pin their references to the branch (v1.3) or choose the latest patch tag (v1.3.2). In either case, only non-breaking fixes will be applied to these tagged branches after their initial release. When a new minor release is created, the branch will be v1.4 and the tag will be v1.4.0 so they don't conflict.

New Feature: version.hpp

A version.hpp file which defines TMC_VERSION_ macros has been added (backported to all branches) for users to identify the availability of certain features. The development branch (main) will identify as the next unreleased version (currently v1.4).

Fixes Applied

Issues have been categorized by severity:

  • 🚩 major bug fix
  • 🐞 minor bug fix
  • 🕵️ theoretical correctness fix
  • 📝 documentation improvements

Fixes are cumulative, so any fix that was applied to v1.0 is also in v1.1 and higher.

In v1.0.1:

  • 🚩 [v1.0 backport] strengthen atomics in mutex/semaphore (#176)
  • 🐞 [v1.0 backport] Add check for Count == 0 in post_bulk() / post_bulk_waitable() (#198)
  • 🐞 [v1.0 backport] fix missing return with channel::set_reuse_blocks and EmbedFirstBlock = true
  • 🐞 [v1.0 backport] fix: missing comparison in manual_reset_event::co_set (#180)
  • 🕵️ [v1.0 backport] preload waiter list to avoid potential race condition (#182)
  • 🕵️ [v1.0 backport] strengthen atomic loads in atomic_condvar (#181)
  • 🕵️ [v1.0 backport] tweak barrier/latch constructor cast order (#179)
  • 🕵️ [v1.0 backport] remove unused coroutine_traits from task_unsafe (#193)
  • 🕵️ [v1.0 backport] add missing include to work_item.hpp
  • 📝 [v1.0 backport] document that executor teardown() and destructor must be called from external thread
  • 📝 [v1.0 backport] update iter_adapter doc comment
  • 📝 [v1.0 backport] fix resume_on() awaitable customizer doc comment
  • 📝 [v1.0 backport] document braid cannot be held across a suspend point
  • 📝 [v1.0 backport] better document spawn_many invariants
  • 📝 [v1.0 backport] better document barrier invariants

Also in v1.1.1:

  • 🚩 [v1.1 backport] fix potential use-after-free in tmc::task_promise::final_suspend with HALO (#201)
  • 🚩 [v1.1 backport] add missing executor/priority dispatch to spawn_group (#171)
  • 🐞 [v1.1 backport] fix awaitable_traits::result_type for spawn_group with dynamic size (#185)
  • 🐞 [v1.1 backport] use mixins with resume_on / enter / exit (#196)
  • 📝 [v1.1 backport] update spawn_group / fork_group::reset() doc comments
  • 📝 [v1.1 backport] document that spawn_group::add*() and fork_group::fork*() are not thread-safe (#186)

Also in v1.2.1:

  • 🕵️ [v1.2 backport] strengthen atomic loads in bitmap_object_pool

Also in v1.3.2:

  • 🚩 [v1.3 backport] fix: remove pr_empty flag in ex_cpu queue which was causing lost wakeups (#175)
  • 🚩 [v1.3 backport] fix potential leak in coro_functor move assignment (#191)
  • 🐞 [v1.3 backport] add cpu_kind operator| (#199)
  • 🐞 [v1.3 backport] fix: Split inboxes for threads in the same group that have different allowed priorities (#177)
  • 📝 [v1.3 backport] topology: make the methods of tmc::topology::cpu_topology const (#187)

v1.3.1

12 Jan 06:42
8d65e6c

Choose a tag to compare

Two hotfixes:

  • #170 - Fix TMC_DEBUG_THREAD_CREATION interaction with hwloc DLL on Windows.
  • #169 - Increased cacheline padding / alignas on Apple M series CPUs to 128 bytes to reduce false sharing. Results from my M2 Macbook Air have been added to the benchmarks chart.

v1.3.0 - the hwloc edition

08 Jan 05:55

Choose a tag to compare

This release dramatically enhances the runtime hardware detection and thread configuration capabilities of TooManyCooks. This makes it possible to write applications that will scale effortlessly on a variety of systems, including bare-metal monolithic, hybrid, or chiplet architecture CPUs, many-core/NUMA machines, or containers/virtualized environments.

There are several new examples demonstrating these capabilities, located here:

>> Examples Link <<

Enhancements to hwloc integration (with TMC_USE_HWLOC)

Prior State in v1.2 - ex_cpu Work Stealing Groups (Automatic)

The following was used internally to optimize work stealing, but was not directly visible to the user:
The number of shared L3 caches on the system is detected and thread groups are created according to those L3 caches. If an executor contains multiple such groups, threads prefer to steal work from other threads in their group before looking outside their group to steal. This is most effective on AMD Zen chiplet architectures, which may have many such caches (one per CCD/chiplet) and high latency for inter-chiplet access.

Thread affinity is set so that each thread may run on any core inside its cache group. This prevents expensive cross-cache thread migrations, while allowing some flexibility in scheduling, if there are other threads running on the same system.

(Docs Link) ex_cpu Work Stealing Groups Improvements

  • Different CPU kinds (Performance or Efficiency cores on hybrid CPUs) are detected and will also be treated as independent groups. If an executor contains multiple CPU kinds, threads prefer to steal from other threads with the same CPU kind before stealing from threads running on a different CPU kind.
  • Caches of any level that are shared among multiple cores can create a group. For example, Apple M processors only expose L2 caches.
  • Irregular cache hierarchies are handled as well. For example, the Intel 13th gen will have an L3 cache group for the P-cores, and an L2 cache group for each cluster of 4 E-cores.

(Docs Link) New Header tmc/topology.hpp

  • tmc::topology::query() can be called to query the CPU topology and return a view optimized for TMC usage that includes NUMA nodes, cache groups and core counts. It also exposes information about CPU kinds, the number of CPUs of each kind, and the SMT level of each cache group (since often only P-cores have SMT). This disambiguates between P-cores, E-cores, and Low-Power E-cores (as seen on latest gen Intel laptop chips).
  • New types cpu_topology, core_group, cpu_kind exposed by the topology object
  • New types thread_pinning_level, thread_packing_strategy, thread_info used by ex_cpu
  • New type topology_filter used by multiple executors to control where threads are allocated
  • New function pin_thread to allow users to match external thread affinity to executor affinity

(Docs Link) New Method on ex_cpu/ex_cpu_st/ex_asio

  • add_partition() allows you to specify which physical cores an executor is allowed to run on (like the taskset command for a single executor). The input to this function is a tmc::topology::topology_filter which can be constructed with information retrieved from the topology query, to specify a specific set of cores, cache groups, or NUMA nodes.

(Docs Link) New Methods on ex_cpu

  • fill_thread_occupancy() fills SMT levels individually for all cores, with awareness of their CPU kind
  • set_thread_init_hook() / set_thread_teardown_hook() have new overloads that receive a tmc::topology::thread_info struct with info about the thread's group and CPU kind.
  • set_thread_pinning_level() defaults to GROUP, but allows pinning to CORE (for benchmarks)
  • set_thread_packing_strategy() controls how threads should be allocated when set_thread_count() is less than the whole system
  • set_work_stealing_strategy controls the work stealing matrix type

(Docs Link) New Killer Feature: Hybrid Work Steering

For ex_cpu only, add_partition() can be called multiple times to split work between multiple partitions at different priority levels. This can be called with any partition type, but is probably most useful when used to split P- and E-cores. These priority ranges can be overlapping (as shown below) or non-overlapping.

    tmc::topology::topology_filter p_cores;
    p_cores.set_cpu_kinds(tmc::topology::cpu_kind::PERFORMANCE);
    tmc::topology::topology_filter e_cores;
    e_cores.set_cpu_kinds(tmc::topology::cpu_kind::EFFICIENCY1);

    // P-cores handle high (priority 0) and medium (priority 1) work
    // E-cores handle medium (priority 1) and low (priority 2) work
    // Work stealing between core types can happen for priority 1 work
    tmc::cpu_executor()
      .add_partition(p_cores, 0, 2)
      .add_partition(e_cores, 1, 3)
      .set_priority_count(3)
      .init();

(Docs Link) New Debug Compile Flag

If you define the preprocessor macro TMC_DEBUG_THREAD_CREATION, executors will print information about thread groups, affinities, and work stealing matrixes when init() is called.

(Docs Link) Container (cgroups) CPU Quota Detection for ex_cpu

  • ex_cpu will automatically detect if Linux cgroups (v1 or v2) CPU quotas have been configured for the application. If so, it will
    create a default number of threads equal to the quota, rounded down to a minimum of 1. This means that if you run with docker run --cpus=2 then 2 threads will be created. This feature does not require TMC_USE_HWLOC and is always active.
  • If TMC_USE_HWLOC is enabled, hwloc can detect if a specific cpuset was allocated. That is if you run with docker run --cpuset-cpus=0,1 then 2 threads will be created, and the usual optimizations based on CPU cache groupings will apply.
  • These features also work with Kubernetes, as long as the underlying containerization is implemented using Linux cgroups.
  • This can be overridden by calling set_thread_count().

(Docs Link) Unlimited Threads Support

By default, ex_cpu uses machine-word-sized bitmaps for thread state tracking. This is highly efficient, but imposes a limit of 32 or 64 threads, based on system word size.

In v1.3, if you define the preprocessor macro TMC_MORE_THREADS, ex_cpu will support an unlimited number of threads. This uses a dynamic bitmap, which does have a small additional performance cost.

(Docs Link) New Executor: ex_manual_st

This is an executor that doesn't own any threads. Work posted to this executor will be queued, but not executed until it is polled by calling a run_*() function, during which the calling thread will execute work on behalf of the executor. This can be used to integrate with an external event loop, e.g. a game engine's main loop, and to poll for continuations at a specific time, without needing synchronization with other elements of the loop. Although any number of threads can post work to it simultaneously, only one thread should call run_*() at any given time.

(Docs Link) New Method on ex_cpu/ex_cpu_st

  • set_spins() controls how many times executor threads spin looking for work before going to sleep

Optimizations

  • Optimized ex_cpu thread sleep/wake logic for the most common scenario - when all threads are working and submitting work within their own priority group.
  • Optimized ex_cpu enqueue/dequeue logic for the most common scenario - when a thread is pushing and popping to its own highest priority queue.
  • Simplified thread sleep/wake logic overall by preferring to wake threads starting at index 0. In addition to reducing the latency of the thread waking calculation, this also has the effect of reducing data migrations, and making it easier for the OS to schedule external threads efficiently.
  • Dynamically size the std::atomic::wait() type used to notify threads that there is more work, based on the target platform. On Linux this type should optimally be 4 bytes, but on Windows it should be 8. On MacOS it doesn't matter.

Removed Footguns

  • Made tmc::task::operator bool() explicit. Having this be be implicit allowed a task to be converted to an int...
  • Removed tmc::task::done(). This was originally provided for compatibility with the std::coroutine_handle API, but was never useful. Since tmc::task destroys itself on completion, this could never safely return true. The only way (currently) to check if a tmc::task is done is by making use of an awaitable.

Breaking Changes

Executor member functions named task_enter_context() and the traits concept requirement tmc::executor_traits::task_enter_context() have been renamed to dispatch() instead. This makes the naming consistent with asio::dispatch which has the same functionality - resume the work inline if running on the same executor, or post it if coming from a different executor.

This is only a breaking change ...

Read more

v1.2.0 - the warning-free edition

23 Nov 18:15

Choose a tag to compare

New Features

  • tmc::ex_cpu_st, an explicitly single-threaded executor. Behaves the same as ex_cpu with .set_thread_count(1), but has better round-trip latency, and better internal execution performance, since it doesn't need internal synchronization like a multi-threaded executor. 📄 ex_cpu_st documentation
  • tmc::channel::try_pull(), a new function that allows you to poll a channel for data without suspending if it is empty. 📄 try_pull documentation

Enhancements

The library, examples, and tests now build without warnings using pedantic warnings settings on all supported compilers and OSes.

The CI builds now enable pedantic warnings and set -Werror to ensure the project remains clean going forward.

v1.1.0 - the HALO edition

23 Oct 15:48
62ab8d8

Choose a tag to compare

Clang HALO attributes

Heap Allocation eLision Optimization (HALO) is an optimization technique applied by the compiler that allows child task allocations to be combined into the parent’s allocation, eliminating per-task heap allocations for improved performance.

In practice, HALO is often not applied automatically by compilers. However, the Clang compiler, starting with Clang 20, offers some additional attributes [[clang::coro_await_elidable]] and [[clang::coro_await_elidable_argument]] which can be used as a hint to to the compiler that it should apply this optimization. TMC now provides these attributes for several types and functions. On non-Clang compilers or Clang versions prior to 20, these functions are safe to use, but provide no additional optimization.

See the documentation for a full primer: https://fleetcode.com/oss/tmc/docs/v1.1/halo/index.html

I will say that I'm not super happy with the final API; the restrictions imposed by Clang's requirements to actually apply HALO via these attributes resulted in a kind of janky solution for forking. And having to introduce additional functions with special rules on a per-compiler basis is not a great experience for users. But, this is an emerging field of compiler development, so it's the best we have access to at the moment.

tmc::fork_group

fork_group is a new type that allows you to initiate individual awaitables immediately, and then await them all at a later time. Multiple different types of awaitables can be dispatched, on different executors/priorities, as long as they all share the same Result type. It offers substantial new flexibility that was not possible previously. For example, the asio_http_server example is now able to track any number of concurrent handler invocations (happening at different times as web requests come in), and ensure they are all finished before exiting, with no memory overhead - only an atomic inc/dec pair per operation.

See the documentation for a full primer: https://fleetcode.com/oss/tmc/docs/v1.1/awaitables/fork_group.html
It also offers a HALO-attributed fork_clang() function: https://fleetcode.com/oss/tmc/docs/v1.1/halo/fork_group.html

tmc::spawn_group

spawn_group is a new type that functions similarly to spawn_many(), but rather than requiring an iterator parameter, you can just call add() on the group to append awaitables and then dispatch + await all of them using operator co_await. If you are familiar with Intel TBB, it behaves similarly to tbb::task_group.

See the documentation for a full primer: https://fleetcode.com/oss/tmc/docs/v1.1/awaitables/spawn_group.html
It also offers a HALO-attributed add_clang() function: https://fleetcode.com/oss/tmc/docs/v1.1/halo/spawn_group.html

Debug Allocation Counter

It can be tricky to tell if HALO is actually working, since the compiler doesn't give any feedback. For this purpose, an optional counter has been added which you can use to profile the number of tmc::task allocations in your own code.

See the documentation for a full primer: https://fleetcode.com/oss/tmc/docs/v1.1/halo/index.html#how-can-you-tell-if-halo-is-working

On AI

This is the first release that was developed with the assistance of AI tools. Rest assured, I don't plan to start shipping AI slop any time soon; I remain committed to delivering quality software, with the cleanest possible API, clear documentation, and optimal performance. And I will never ship code that I don't 100% understand. But I also want to ship more features, and faster.

The AI is tremendously fast at generating code, and I already have prototypes for most of the key features of v1.2. At this point, the bottleneck is for me to carefully review and/or rewrite every single line, and obsessively benchmark and profile as always.

I also used AI to generate initial drafts of all the documentation for this release. Although I rewrote everything afterward, I left some of the initial AI-generated structure in - for example the tmc::spawn_group Key Differences, Template Parameters, Result Storage, Imperative Add Interface - that's all AI. It seems like a reasonable inclusion, but there's also the risk of this becoming excessively verbose, as most of this information is documented on the template specializations themselves in the API reference - which is the approach I previously relied on for tmc::spawn_many. Feel free to let me know in the discussions how you feel about this documentation style.

v1.0.0

02 Oct 03:12

Choose a tag to compare

Announcing v1.0.0!

This is the first stable release of TooManyCooks. It offers an excellent foundation of performance and features to build on. However, I have quite a few exciting developments planned for the next round of releases, so stay tuned for more.

New Features

  • Added channel::post_bulk() with 3 overloads that accept iterator/count, begin/end iterator pair, or range types. This function is more efficient than posting in a loop when multiple elements need to be submitted.
  • Added channel::new_token() which is identical to the token copy constructor, but can be used to more explicitly indicate that a new handle / hazard pointer is being created.
  • ex_any now implements tmc::detail::executor_traits. You can now use a variable of type tmc::ex_any* to dynamically switch between the executor that is passed to some functions.

Enhancements

  • When building with TMC_USE_HWLOC, hwloc.h is now only required by the implementation (where TMC_IMPL is defined) and not by the public headers. Thus, you only need to make this file available on the include path for the implementation compilation unit.
  • When calling post_waitable() and passing an unevaluated coroutine ramp function, a static_assert will disallow this probably erroneous behavior. See #113 for more detail.
  • Pointer tagging was removed from the implementations of channel and coro_functor.
  • The entire library is now 32-bit compatible (as a result of the pointer tagging changes).
  • Implemented ex_braid on top of channel instead of on qu_lockfree. This simplifies the implementation and makes the task list linearizable.

Fixes

  • Fixed a race condition when shutting down ex_cpu immediately after posting the last task from an external thread.
  • Fixed a race condition when shutting down ex_braid immediately after posting the last task from an external thread.
  • Fixed an issue with hwloc CPU grouping on systems that expose no L3 caches (such as Apple M processors).

Compatibility

The library now builds and runs without issues on Visual Studio 2026 Insiders (MSVC Build Tools v145) or newer, as this bug was finally fixed by the MSVC team.