From Simon Willison’s Weblog it seems to be about 260 tokens per frame, where each frame comes from one second of a video, and each of these frames is being processed the same as any image:
it looks like it really does work by breaking down the video into individual frames and processing each one as an image.
And at the end:
The image input was 258 tokens, the total token count after the response was 410 tokens—so 152 tokens for the response from the model. Those image tokens pack in a lot of information!
But these 152 tokens are just the titles and authors of the books. Information about the order, size, colors, textures, etc.... (read more)
From Simon Willison’s Weblog it seems to be about 260 tokens per frame, where each frame comes from one second of a video, and each of these frames is being processed the same as any image:
And at the end:
But these 152 tokens are just the titles and authors of the books. Information about the order, size, colors, textures, etc.... (read more)