Dr Yoshi Gotoh

PhD

School of Computer Science

Lecturer

Student Projects Officer

Member of the Speech and Hearing (SpandH) research group

y.gotoh@sheffield.ac.uk

Regent Court (CS)

Full contact details

Dr Yoshi Gotoh
School of Computer Science
Regent Court (CS)
211 Portobello
Sheffield
S1 4DP

Profile: Yoshi is a lecturer in the Department of Computer Science. He has a first degree in Engineering form the University of Tokyo and a PhD from Brown University.

Research interests: Yoshi has been working in the field of speech and spoken language processing for years. His current interests include audio visual processing, in particular, video analysis and video information retrieval.

Publications

Journal articles

Al Ghamdi M & Gotoh Y (2020) Graph-based topic models for trajectory clustering in crowd videos. Machine Vision and Applications, 31. View this article in WRRO
Khan MUG & Gotoh Y (2017) Generating natural language tags for video information management. Machine Vision and Applications, 28(3-4), 243-265. View this article in WRRO
Khan MUG, Nasir A, Riaz O, Gotoh Y & Amiruddin M (2016) A statistical model for annotating videos with human actions. Pakistan Journal of Statistics, 32(2), 109-123. View this article in WRRO
Khan M, AlHarbi N & Gotoh Y (2015) A framework for creating natural language descriptions of video streams. Information Sciences, 303, 61-82. View this article in WRRO
Al Harbi N & Gotoh Y (2015) A unified spatio-temporal human body region tracking approach to action recognition. Neurocomputing, 161, 56-64. View this article in WRRO
Zhang L, Gotoh Y & Khan M (2012) Spoken document retrieval based on confusion network with syllable fragments. International Journal of Advanced Robotic Systems, 9.
Kolluru B & Gotoh Y (2009) On the subjectivity of human-authored summaries. NAT LANG ENG, 15, 193-213.
Punitha P, Misra H, Ren R, Hannah D, Goyal A, Villa R & Jose JM (2009) Glasgow University at TRECVID 2009. 2009 Trec Video Retrieval Evaluation Notebook Papers.
Christensen H, Gotoh Y & Renals S (2008) A cascaded broadcast news highlighter. IEEE T AUDIO SPEECH, 16(1), 151-161.
Gotoh Y & Renals S (2000) Information extraction from broadcast news. PHILOS T ROY SOC A, 358(1769), 1295-1309.
Gotoh Y & Renals S (1999) Topic-based mixture language modelling. Natural Language Engineering, 5(4), 355-375.
Gotoh Y, Hochberg MM & Silverman HF (1998) Efficient training algorithms for HMM's using incremental estimation. IEEE T SPEECH AUDI P, 6(6), 539-548.
Charniak E, Carroll G, Adcock J, Cassandra A, Gotoh Y, Katz J, Littman M & McCanna J (1996) Taggers for parsers. Artificial Intelligence, 85(1-2), 45-57.
Charniak E, Caroll G, Adcock J, Cassandra A, Gotoh Y, Katz J, Littman M & McCann J (1996) Taggers for parsers. Artificial Intelligence, 84(1-2), 357-357.
Mashao D, Gotoh Y & Silverman HF (1996) Analysis of LPC/DFT features for an HMM-based alphadigit recognizer. IEEE Signal Processing Letters, 3(4), 103-106.

Conference proceedings

Clarke J, Gotoh Y & Goetze S (2025) Ensembling synchronisation-based and face–voice association paradigms for robust active speaker detection in egocentric recordings. Speech and Computer: 27th International Conference, SPECOM 2025, Szeged, Hungary, October 13-15, 2025, Proceedings, Part II, Vol. LNAI 16188 (pp 289-301). Szeged, Hungary, 13 October 2025 - 13 October 2025. View this article in WRRO
Clarke J, Gotoh Y & Goetze S (2024) Improving audiovisual active speaker detection in egocentric recordings with the data-efficient image transformer. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Taipei, Taiwan, 16 December 2023 - 16 December 2023. View this article in WRRO
Alrashidi A, Cudd P, Abhayaratne C & Gotoh Y (2023) Exploration of verbal descriptions and dynamic indoors environments for people with sight loss. CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (pp 110). Hamburg, Germany, 23 April 2023 - 23 April 2023. View this article in WRRO
Alvi M, Khan MUG, Gotoh Y, Sadiq M & Aslam M (2020) University of Engineering & Technology, Lahore the University of Sheffield at TRECVID 2015: Instance search. 2015 TREC Video Retrieval Evaluation, TRECVID 2015
Amanat S, Khan MUG, Nida N & Gotoh Y (2020) The University of Sheffield and University of Engineering & Technology, Lahore at TECVID 2014: Instance search task. 2014 TREC Video Retrieval Evaluation, TRECVID 2014
Algadhy R, Gotoh Y & Maddock S (2019) 3D visual speech animation using 2D videos. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp 2367-2371). Brighton, UK, 12 May 2019 - 12 May 2019. View this article in WRRO
Al Ghamdi M & Gotoh Y (2019) Graph-based correlated topic model for motion patterns analysis in crowded scenes from tracklets. British Machine Vision Conference 2018, BMVC 2018
Al Ghamdi M & Gotoh Y (2018) Graph-based correlated topic model for trajectory clustering in crowded videos. 2018 IEEE Winter Conference on Applications of Computer Vision (pp 1029-1037). Lake Tahoe, NV/CA, 12 March 2018 - 12 March 2018. View this article in WRRO
Al Ghamdi M & Gotoh Y (2018) Graph-based correlated topic model for motion patterns analysis in crowded scenes from tracklets. British Machine Vision Conference 2018 Bmvc 2018
Khan MUG, Gotoh Y & Nida N (2017) Medical image colorization for better visualization and segmentation. Medical Image Understanding and Analysis, Vol. 723 (pp 571-580)
Al Harbi N & Gotoh Y (2017) Natural language descriptions for human activities in video streams. Proceedings of the 10th International Conference on Natural Language Generation (pp 85-94). Santiago de Compostela, Spain, 4 September 2017 - 4 September 2017. View this article in WRRO
Al Harbi N & Gotoh Y (2016) Natural language descriptions of human activities scenes: Corpus generation and analysis. Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp 39-47)
Algadhy R, Gotoh Y & Maddock S (2016) Analysis of visemes in the GRID corpus. Abstract of UKspeech
Masrani A & Gotoh Y (2016) Overlapped interest and the impact of visual and audio information in the human perception. Abstract of UKspeech
Wahla SQ, Waqar S, Ghani Khan MU & Gotoh Y (2016) The University of Sheffield and University of Engineering & Technology, Lahore at TRECVID 2016: Video to text description task. 2016 Trec Video Retrieval Evaluation Trecvid 2016
Masrani A & Gotoh Y (2015) Corpus generation and analysis: incorporating audio data towards curbing missing information. Proceedings of KDWEB
Al Harbi N & Gotoh Y (2015) Describing spatio-temporal relations between object volumes in video streams. Aaai Workshop Technical Report, Vol. WS-15-14 (pp 2-8)
Alvi M, Khan MUG, Gotoh Y, Sadiq M & Aslam M (2015) University of Engineering & Technology, Lahore the University of Sheffield at TRECVID 2015: Instance search. 2015 Trec Video Retrieval Evaluation Trecvid 2015
Al Ghamdi M & Gotoh Y (2014) Manifold matching with application to instance search based on video queries. ICISP. Cherbourg, 30 June 2014.
Al Ghamdi M & Gotoh Y (2014) Alignment of nearly-repetitive contents in a video stream with manifold embedding. ICASSP. Firenze
Al Ghamdi M & Gotoh Y (2014) Video clip retrieval by graph matching. ECIR. Amsterdam
Amanat S, Khan MUG, Nida N & Gotoh Y (2014) The University of Sheffield and University of Engineering & Technology, Lahore at TECVID 2014: Instance search task. 2014 Trec Video Retrieval Evaluation Trecvid 2014
Al Harbi N & Gotoh Y (2013) Action recognition: spatio-temporal human body region tracking approach. CAIP - REACTS workshop. York
Al Ghamdi M & Gotoh Y (2013) Spatio-temporal manifold embedding for nearly-repetitive contents in a video stream. CAIP. York
Al Harbi N & Gotoh Y (2013) Spatio-temporal human body segmentation from video stream. CAIP. York
Khan MUG, Bashir K, Shah AA, Zhang L, Gotoh Y, Khan PI & Amiruddin M (2013) The University of Sheffield, Harbin Engineering University and University of Engineering & Technology, Lahore at TRECVID 2013: Instance search & semantic indexing. 2013 Trec Video Retrieval Evaluation Trecvid 2013
Khan M, Bashir K, Shah A, Zhang L, Gotoh Y, Khan P & Amiruddin M (2013) The University of Sheffield, Harbin Engineering University and University of Engineering & Technology, Lahore at TRECVID 2013: Instance Search & Semantic indexing. TRECVID
Al Ghamdi M, Khan M, Zhang L & Gotoh Y (2012) The University of Sheffield and Harbin Engineering University at TRECVID 2012: Instance Search. TRECVID
Khan M, Zhang L & Gotoh Y (2011) Human focused video description. ICCV - VECTaR workshop. Barcelona
Zhang L, Khan M & Gotoh Y (2011) Video scene classification based on natural language description. ICCV - ARTEMIS workshop. Barcelona
Khan M, Zhang L & Gotoh Y (2011) Towards coherent natural language description of video streams. ICCV - SIG workshop. Barcelona
Chantamunee S & Gotoh Y (2010) Nearly-repetitive video synchonisation using nonlinear manifold embedding. ICASSP. Dallas
Chantamunee S & Gotoh Y (2008) University of Sheffield at TRECVID 2008: Rushes Summarisation and Video Copy Detection.. TRECVID
Chantamunee S & Gotoh Y (2008) Shot alignment in pre-production video. MLMI. Utrecht
Chantamunee S & Gotoh Y (2007) University of Sheffield at TRECVID 2007: Shot Boundary Detection and Rushes Summarisation.. TRECVID
Kolluru B & Gotoh Y (2007) Speaker Role Based Structural Classification of Broadcast News Stories. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4 (pp 141-144)
Kolluru B & Gotoh Y (2007) Relative Evaluation of Informativeness in Machine Generated Summaries. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4 (pp 145-148)
Kolluru B, Christensen H & Gotoh Y (2005) Mutli-stage compaction approach to broadcast news summarisation. Interspeech. Lisbon
Kolluru B & Gotoh Y (2005) On the subjectivity of human authored short summaries. ACL Workshop: Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarizati. Ann Arbor
Christensen H, Kolluru BK, Gotoh Y & Renals S (2005) Maximum entropy segmentation of broadcast news. 2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5 (pp 1029-1032)
Kolluru B, Christensen H & Gotoh Y (2004) Decremental feature-based compaction. DUC Workshop. Boston
Christensen H, Kolluru BK, Gotoh Y & Renals S (2004) From text summarisation to style-specific summarisation for broadcast news. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, Vol. 2997 (pp 223-237)
Christensen H, Gotoh Y, Kolluru B & Renals S (2003) Are extractive text summarisation techniques portable to broadcast news?. ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03 (pp 489-494)
Kolluru B, Christensen H, Gotoh Y & Renals S (2003) Exploring the style-technique interaction in extractive summarization of broadcast news. ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03 (pp 495-500)
Gotoh Y & Renals S (2003) Statistical language modelling. TEXT- AND SPEECH-TRIGGERED INFORMATION ACCESS, Vol. 2705 (pp 78-105)
Christensen H, Gotoh Y & Renals S (2001) Punctuation Annotation Using Statistical Prosody Models. Proceedings of the ISCA Workshop on Prosody in Speech Recognition and Understanding (pp 35-40)
Gotoh Y & Renals S (2000) Sentence boundary detection in broadcast speech transcripts. ISCA ASR Workshop. Paris
Gotoh Y & Renals S (2000) Variable word rate n-grams. 2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI (pp 1591-1594)
Renals S & Gotoh Y (1999) Integrated transcription and identification of named entities in broadcast speech. Eurospeech. Budapest
Gotoh Y & Renals S (1999) Statistical annotation of named entities in spoken audio. ESCA Workshop: Accessing Information in Spoken Audio. Cambridge
Gotoh Y, Renals S & Williams G (1999) Named entity tagged language models. ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI (pp 513-516)
Gotoh Y & Renals S (1997) Document space models using latent semantic analysis. Eurospeech. Rhodes
Adcock J, Gotoh Y, Mashao D & Silverman HF (1996) Microphone-array speech recognition via incremental MAP training.. ICASSP. Atlanta
Gotoh Y & Silverman HF (1996) Incremental ML estimation of HMM parameters for efficient training. ICASSP. Atlanta
Gotoh Y, Hochberg MM, Mashao D & Silverman HF (1995) Incremental MAP estimation of HMMs for efficient training and improved performance. ICASSP. Detroit
Gotoh Y, Hochberg MM & Silverman HF (1994) Using MAP estimated parameters to improve HMM speech recognition performance. ICASSP. Adelaide
Clarke J, Gotoh Y & Goetze S () Face-Voice Association for Audiovisual Active Speaker Detection in Egocentric Recordings. Proceedings of the European Signal Processing Conference
Clarke J, Gotoh Y & Goetze S () Speaker Embedding Informed Audiovisual Active Speaker Detection for Egocentric Recordings. Proceedings of the ... IEEE International Conference on Acoustics, Speech, and Signal Processing / sponsored by the Institute of Electrical and Electronics Engineers Signal Processing Society. ICASSP (Conference)
Khan M, Al Harbi N & Gotoh Y () Natural language descriptions for video streams. V&L Net Workshop. Sheffield, December 2012.
Al Ghamdi M, Zhang L & Gotoh Y () Spatio-temporal SIFT and its application to human action classification. ECCV - VECTaR workshop. Firenze, October 2012.
Al Ghamdi M, Al Harbi N & Gotoh Y () Spatio-temporal video representation with locality-constrained linear coding. ECCV - ARTEMIS workshop. Firenze, October 2012.
Khan M, Zhang L & Gotoh Y () Generating coherent natural language annotations for video streams. ICIP. Orlando, September 2012.
Khan M & Gotoh Y () Natural language descriptions of visual scenes: corpus generation and analysis. EACL workshop. Avignon, April 2012.
Khan M & Gotoh Y () Describing video contents in natural language. EACL workshop. Avignon, April 2012.
Kolluru B & Gotoh Y () Speaker role based structural classification of broadcast news stories. Interspeech 2007 (pp 2593-2596)
Kolluru B & Gotoh Y () Relative evaluation of informativeness in machine generated summaries. Interspeech 2007 (pp 1338-1341)

Working papers

Urban J, Hilaire X, Hopfgartner F, Villa R, Jose JM, Chantamunee S & Gotoh Y (2006) Glasgow University at TRECVID 2006. TRECVID 2006 - Text REtrieval Conference TRECVid Workshop, 363-367.

Grants

Visual Understanding for Fake Imagery Detect, Innovate UK, 09/2021 - 03/2024, £218,226, as Co-PI
Multimedia Analysis for Unsupervised Dubbing In Entertainment (MAUDIE), Innovate UK, 04/2018 - 03/2021, £393,115, as Co-PI
S3L: Statistical Summarization of Spoken Language, EPSRC, 12/2001 - 09/2005, £284,248, as Co-PI

Professional activities and memberships: Member of the Speech and Hearing research group

School of Computer Science

School of Computer Science

Dr Yoshi Gotoh

Journal articles

Conference proceedings

Working papers

Links