Dr Yoshi Gotoh
PhD
School of Computer Science
Lecturer
Student Projects Officer
Member of the Speech and Hearing (SpandH) research group
y.gotoh@sheffield.ac.uk
Regent Court (DCS)
Full contact details
Dr Yoshi Gotoh
School of Computer Science
Regent Court (DCS)
211 Portobello
Sheffield
S1 4DP
School of Computer Science
Regent Court (DCS)
211 Portobello
Sheffield
S1 4DP
- Profile
-
Yoshi is a lecturer in the Department of Computer Science. He has a first degree in Engineering form the University of Tokyo and a PhD from Brown University.
- Research interests
-
Yoshi has been working in the field of speech and spoken language processing for years. His current interests include audio visual processing, in particular, video analysis and video information retrieval.
- Publications
-
Journal articles
- Graph-based topic models for trajectory clustering in crowd videos. Machine Vision and Applications, 31. View this article in WRRO
- Generating natural language tags for video information management. Machine Vision and Applications, 28(3-4), 243-265. View this article in WRRO
- A statistical model for annotating videos with human actions. Pakistan Journal of Statistics, 32(2), 109-123. View this article in WRRO
- A framework for creating natural language descriptions of video streams. Information Sciences, 303, 61-82. View this article in WRRO
- A unified spatio-temporal human body region tracking approach to action recognition. Neurocomputing, 161, 56-64. View this article in WRRO
- Spoken document retrieval based on confusion network with syllable fragments. International Journal of Advanced Robotic Systems, 9.
- On the subjectivity of human-authored summaries. NAT LANG ENG, 15, 193-213.
- Glasgow University at TRECVID 2009. 2009 Trec Video Retrieval Evaluation Notebook Papers.
- A cascaded broadcast news highlighter. IEEE T AUDIO SPEECH, 16(1), 151-161.
- Information extraction from broadcast news. PHILOS T ROY SOC A, 358(1769), 1295-1309.
- Topic-based mixture language modelling. Natural Language Engineering, 5(4), 355-375.
- Efficient training algorithms for HMM's using incremental estimation. IEEE T SPEECH AUDI P, 6(6), 539-548.
- Taggers for parsers. Artificial Intelligence, 85(1-2), 45-57.
- Taggers for parsers. Artificial Intelligence, 84(1-2), 357-357.
- Analysis of LPC/DFT features for an HMM-based alphadigit recognizer. IEEE Signal Processing Letters, 3(4), 103-106.
Conference proceedings
- Ensembling synchronisation-based and face–voice association paradigms for robust active speaker detection in egocentric recordings. Speech and Computer: 27th International Conference, SPECOM 2025, Szeged, Hungary, October 13-15, 2025, Proceedings, Part II, Vol. LNAI 16188 (pp 289-301). Szeged, Hungary, 13 October 2025 - 13 October 2025. View this article in WRRO
- Improving audiovisual active speaker detection in egocentric recordings with the data-efficient image transformer. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Taipei, Taiwan, 16 December 2023 - 16 December 2023. View this article in WRRO
- Exploration of verbal descriptions and dynamic indoors environments for people with sight loss. CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (pp 110). Hamburg, Germany, 23 April 2023 - 23 April 2023. View this article in WRRO
- University of Engineering & Technology, Lahore the University of Sheffield at TRECVID 2015: Instance search. 2015 TREC Video Retrieval Evaluation, TRECVID 2015
- The University of Sheffield and University of Engineering & Technology, Lahore at TECVID 2014: Instance search task. 2014 TREC Video Retrieval Evaluation, TRECVID 2014
- 3D visual speech animation using 2D videos. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp 2367-2371). Brighton, UK, 12 May 2019 - 12 May 2019. View this article in WRRO
- Graph-based correlated topic model for motion patterns analysis in crowded scenes from tracklets. British Machine Vision Conference 2018, BMVC 2018
- Graph-based correlated topic model for trajectory clustering in crowded videos. 2018 IEEE Winter Conference on Applications of Computer Vision (pp 1029-1037). Lake Tahoe, NV/CA, 12 March 2018 - 12 March 2018. View this article in WRRO
- Graph-based correlated topic model for motion patterns analysis in crowded scenes from tracklets. British Machine Vision Conference 2018 Bmvc 2018
- Medical image colorization for better visualization and segmentation. Medical Image Understanding and Analysis, Vol. 723 (pp 571-580)
- Natural language descriptions for human activities in video streams. Proceedings of the 10th International Conference on Natural Language Generation (pp 85-94). Santiago de Compostela, Spain, 4 September 2017 - 4 September 2017. View this article in WRRO
- Natural language descriptions of human activities scenes: Corpus generation and analysis. Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp 39-47)
- Analysis of visemes in the GRID corpus. Abstract of UKspeech
- Overlapped interest and the impact of visual and audio information in the human perception. Abstract of UKspeech
- The University of Sheffield and University of Engineering & Technology, Lahore at TRECVID 2016: Video to text description task. 2016 Trec Video Retrieval Evaluation Trecvid 2016
- Corpus generation and analysis: incorporating audio data towards curbing missing information. Proceedings of KDWEB
- Describing spatio-temporal relations between object volumes in video streams. Aaai Workshop Technical Report, Vol. WS-15-14 (pp 2-8)
- University of Engineering & Technology, Lahore the University of Sheffield at TRECVID 2015: Instance search. 2015 Trec Video Retrieval Evaluation Trecvid 2015
- Manifold matching with application to instance search based on video queries. ICISP. Cherbourg, 30 June 2014.
- Alignment of nearly-repetitive contents in a video stream with manifold embedding. ICASSP. Firenze
- Video clip retrieval by graph matching. ECIR. Amsterdam
- The University of Sheffield and University of Engineering & Technology, Lahore at TECVID 2014: Instance search task. 2014 Trec Video Retrieval Evaluation Trecvid 2014
- Action recognition: spatio-temporal human body region tracking approach. CAIP - REACTS workshop. York
- Spatio-temporal manifold embedding for nearly-repetitive contents in a video stream. CAIP. York
- Spatio-temporal human body segmentation from video stream. CAIP. York
- The University of Sheffield, Harbin Engineering University and University of Engineering & Technology, Lahore at TRECVID 2013: Instance search & semantic indexing. 2013 Trec Video Retrieval Evaluation Trecvid 2013
- The University of Sheffield, Harbin Engineering University and University of Engineering & Technology, Lahore at TRECVID 2013: Instance Search & Semantic indexing. TRECVID
- The University of Sheffield and Harbin Engineering University at TRECVID 2012: Instance Search. TRECVID
- Human focused video description. ICCV - VECTaR workshop. Barcelona
- Video scene classification based on natural language description. ICCV - ARTEMIS workshop. Barcelona
- Towards coherent natural language description of video streams. ICCV - SIG workshop. Barcelona
- Nearly-repetitive video synchonisation using nonlinear manifold embedding. ICASSP. Dallas
- University of Sheffield at TRECVID 2008: Rushes Summarisation and Video Copy Detection.. TRECVID
- Shot alignment in pre-production video. MLMI. Utrecht
- University of Sheffield at TRECVID 2007: Shot Boundary Detection and Rushes Summarisation.. TRECVID
- Speaker Role Based Structural Classification of Broadcast News Stories. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4 (pp 141-144)
- Relative Evaluation of Informativeness in Machine Generated Summaries. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4 (pp 145-148)
- Mutli-stage compaction approach to broadcast news summarisation. Interspeech. Lisbon
- On the subjectivity of human authored short summaries. ACL Workshop: Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarizati. Ann Arbor
- Maximum entropy segmentation of broadcast news. 2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5 (pp 1029-1032)
- Decremental feature-based compaction. DUC Workshop. Boston
- From text summarisation to style-specific summarisation for broadcast news. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, Vol. 2997 (pp 223-237)
- Are extractive text summarisation techniques portable to broadcast news?. ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03 (pp 489-494)
- Exploring the style-technique interaction in extractive summarization of broadcast news. ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03 (pp 495-500)
- Statistical language modelling. TEXT- AND SPEECH-TRIGGERED INFORMATION ACCESS, Vol. 2705 (pp 78-105)
- Punctuation Annotation Using Statistical Prosody Models. Proceedings of the ISCA Workshop on Prosody in Speech Recognition and Understanding (pp 35-40)
- Sentence boundary detection in broadcast speech transcripts. ISCA ASR Workshop. Paris
- Variable word rate n-grams. 2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI (pp 1591-1594)
- Integrated transcription and identification of named entities in broadcast speech. Eurospeech. Budapest
- Statistical annotation of named entities in spoken audio. ESCA Workshop: Accessing Information in Spoken Audio. Cambridge
- Named entity tagged language models. ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI (pp 513-516)
- Document space models using latent semantic analysis. Eurospeech. Rhodes
- Microphone-array speech recognition via incremental MAP training.. ICASSP. Atlanta
- Incremental ML estimation of HMM parameters for efficient training. ICASSP. Atlanta
- Incremental MAP estimation of HMMs for efficient training and improved performance. ICASSP. Detroit
- Using MAP estimated parameters to improve HMM speech recognition performance. ICASSP. Adelaide
- Face-Voice Association for Audiovisual Active Speaker Detection in Egocentric Recordings. Proceedings of the European Signal Processing Conference
- Speaker Embedding Informed Audiovisual Active Speaker Detection for Egocentric Recordings. Proceedings of the ... IEEE International Conference on Acoustics, Speech, and Signal Processing / sponsored by the Institute of Electrical and Electronics Engineers Signal Processing Society. ICASSP (Conference)
- Natural language descriptions for video streams. V&L Net Workshop. Sheffield, December 2012.
- Spatio-temporal SIFT and its application to human action classification. ECCV - VECTaR workshop. Firenze, October 2012.
- Spatio-temporal video representation with locality-constrained linear coding. ECCV - ARTEMIS workshop. Firenze, October 2012.
- Generating coherent natural language annotations for video streams. ICIP. Orlando, September 2012.
- Natural language descriptions of visual scenes: corpus generation and analysis. EACL workshop. Avignon, April 2012.
- Describing video contents in natural language. EACL workshop. Avignon, April 2012.
- Speaker role based structural classification of broadcast news stories. Interspeech 2007 (pp 2593-2596)
- Relative evaluation of informativeness in machine generated summaries. Interspeech 2007 (pp 1338-1341)
Working papers
- Graph-based topic models for trajectory clustering in crowd videos. Machine Vision and Applications, 31. View this article in WRRO
- Grants
-
- Visual Understanding for Fake Imagery Detect, Innovate UK, 09/2021 - 03/2024, £218,226, as Co-PI
- Multimedia Analysis for Unsupervised Dubbing In Entertainment (MAUDIE), Innovate UK, 04/2018 - 03/2021, £393,115, as Co-PI
- S3L: Statistical Summarization of Spoken Language, EPSRC, 12/2001 - 09/2005, £284,248, as Co-PI
- Professional activities and memberships
-
Member of the Speech and Hearing research group